VideoQA-SC: Adaptive Semantic Communication for Video Question Answering

Jiangyuan Guo, Wei Chen, Yuxuan Sun, Jialong Xu, Bo Ai
School of Electronic and Information Engineering
Bei**g Jiaotong University
Bei**g, China
{jiangyuanguo, weich, yxsun, jialongxu, boai}@bjtu.edu.cn
Abstract

Although semantic communication (SC) has shown its potential in efficiently transmitting multi-modal data such as text, speeches and images, SC for videos has focused primarily on pixel-level reconstruction. However, these SC systems may be suboptimal for downstream intelligent tasks. Moreover, SC systems without pixel-level video reconstruction present advantages by achieving higher bandwidth efficiency and real-time performance of various intelligent tasks. The difficulty in such system design lies in the extraction of task-related compact semantic representations and their accurate delivery over noisy channels. In this paper, we propose an end-to-end SC system for video question answering (VideoQA) tasks called VideoQA-SC. Our goal is to accomplish VideoQA tasks directly based on video semantics over noisy or fading wireless channels, bypassing the need for video reconstruction at the receiver. To this end, we develop a spatiotemporal semantic encoder for effective video semantic extraction, and a learning-based bandwidth-adaptive deep joint source-channel coding (DJSCC) scheme for efficient and robust video semantic transmission. Experiments demonstrate that VideoQA-SC outperforms traditional and advanced DJSCC-based SC systems that rely on video reconstruction at the receiver under a wide range of channel conditions and bandwidth constraints. In particular, when the signal-to-noise ratio is low, VideoQA-SC can improve the answer accuracy by 5.17% while saving almost 99.5% of the bandwidth at the same time, compared with the advanced DJSCC-based SC system. Our results show the great potential of task-oriented SC system design for video applications.

Keywords Semantic communication  \cdot video question answering  \cdot DJSCC  \cdot bandwidth allocation  \cdot multimodal task

With the development of artificial intelligence (AI) technology, many edge machines deploy AI models to process information for intelligent tasks[1, 2], which support AI-empowered applications such as remote healthcare, autonomous driving, and the Internet of Things (IoT). Semantic communication (SC) is an emerging paradigm which aims to extract task-relevant crucial information and accomplish accurate semantic delivery, ultimately completing the intelligent tasks[3]. Thanks to the effective semantic extraction by deep neural networks, SC can achieve higher data compression ratios and faster execution of intelligent tasks compared to the traditional communication. As a result, SC is widely used in many intelligent applications that require low latency and high accuracy under limited bandwidth resources, e.g., IoT networks[4, 5], intelligently connected vehicle networks[6, 7] and smart factories [8].

In a typical SC system, the transceiver is designed as a semantic codec (semantic encoder/decoder) represented by a neural network [9, 10, 11]. The semantic encoder at the transmitter needs to remove data redundancy and extract the compact semantic representation based on the structural characteristics of the source data. The semantic decoder at the receiver aims to process received semantic information to obtain results according to the specific intelligent task.

For different source data modalities (speeches, text, images, etc.), appropriate neural network architectures are essential for semantic codecs to achieve efficient SC. Long Short-Term Memorys and Transformers can be utilized to model the sequential information for text[12, 13], and convolutional neural networks can be utilized to extract local information for speeches[9, 14, 15] and images[16, 10, 17]. Furthermore, deep joint source-channel coding (DJSCC) can be integrated with SC for end-to-end (E2E) training to resist wireless noise while improving the overall performance of SC [16, 10, 17, 18].

Unlike SC for text or images, video-based SC presents greater challenges due to the extra temporal correlations presented in videos. Building on traditional video compression techniques, earlier studies [19, 11, 20, 21] break down video transmission into the sequential transmission of several frames using conditional coding. The current frame is modeled as the conditional distribution with respect to the adjacent reference frames. Then, frames are encoded and decoded sequentially based on the reference frames in actual transmission. Some studies [22, 23] segment frames into backgrounds and key points/segments for semantic extraction and transmission. However, these approaches lack spatiotemporal modeling of the whole video, leading to inefficient semantic extraction. Overall, current video-based SC research mainly focuses on video reconstruction, with few investigations and developments for other intelligent functionalities. Furthermore, research into multimodal video-based SC has yet to be extensively explored.

Video question answering (VideoQA), where machines automatically answer natural language questions with video contents, is an intelligent task in the popular visual-language understanding domain. Solving VideoQA tasks enables innovative applications in human-machine interactions such as virtual reality, smart cities, and the metaverse. The proliferation of multimedia applications and the extensive deployment of cameras have led to a significant presence of videos in machines, affecting both human-machine and machine-machine communications.

Compared with image-based visual question answering, VideoQA includes a broader range of question types. It involves not only recognition of visual objects, actions, and events, but also reasoning of spatiotemporal and causal relationships, making it more challenging [24]. The key of VideoQA is the understanding of video contents with questions, which drives extensive research on how to effectively handle videos and questions [25, 26, 27, 28, 29, 30]. Some works jointly extract frame features and motion features to get effective video representations [25, 26, 27]. The generic backbone like Vision Transformers are used to obtain general video representations in [28, 29]. Moreover, innovative loss and training methods are developed to align video and question features for multimodal fusion [26, 28, 29, 30]. However, the video features extracted by traditional VideoQA methods are usually of high dimensions, which may not meet bandwidth constraints in wireless networks. Channel fading and noise also affect the accurate transmission of video features, resulting in degradation of VideoQA performance.

Typically, the development of SC systems for VideoQA tasks encounters two key challenges:

  1. 1.

    How to model the spatiotemporal correlations of videos to achieve efficient semantic extraction?

  2. 2.

    How to mitigate the effects of wireless channel degradation and meet bandwidth limitations while maintaining effective VideoQA performance?

In this paper, we investigate an E2E multimodal SC system named VideoQA-SC for VideoQA tasks. The proposed VideoQA-SC mainly incorporates two customized modules to address the above challenges: a spatiotemporal video semantic encoder and a learning-based bandwidth-adaptive joint source-channel (JSC) encoder/decoder. Experimental results demonstrate that the proposed VideoQA-SC achieves noise robustness and bandwidth efficiency.

The main contributions of this paper are summarized as follows:

  1. 1.

    An E2E SC System for VideoQA Tasks: We propose an E2E multimodal SC system called VideoQA-SC for VideoQA tasks. VideoQA-SC exploits the efficient video semantic extraction and the bandwidth-adaptive DJSCC transmission to fully leverage video information, which is noise robustness and bandwidth efficiency with promising task performance.

  2. 2.

    Spatiotemporal Semantic Encoder: We propose a spatiotemporal semantic encoder to extract compact and comprehensive video semantics for transmission. Transformer and the graph neural network are utilized to model the temporal and spatial correlations of videos, which is beneficial for understanding video contents.

  3. 3.

    Cross-Attention Based JSC Encoder/Decoder: We propose a dual-branch cross-attention Transformer structure as both the JSC encoder and decoder with the learnable rate embedding shared between the transmitter and receiver. The structure allows for progressive refinement of the semantics at both the transmitter and receiver by the cross-attention architecture.

  4. 4.

    Learning-Based Adaptive Bandwidth Allocation: We develop a series of learning-based rate predictors to allocate bandwidth to video semantics for transmission. The rate predictors can learn the importance of different tokens in semantics, improving the bandwidth efficiency of SC systems. Moreover, the rate predictors allow other useful information, e.g., channel state information, to serve as additional guidance of bandwidth allocation, demonstrating good scalability for learning-based bandwidth allocation methods.

  5. 5.

    Experimental Analysis: We verify the performance of VideoQA-SC on the TGIF-QA[31] dataset. Experiments demonstrate that VideoQA-SC outperforms traditional communication systems and other DJSCC-based SC systems under a wide range of channel conditions and bandwidth constraints. In particular, VideoQA-SC improves 5.17%percent5.175.17\%5.17 % VideoQA accuracy while achieving nearly 99.5%percent99.599.5\%99.5 % bandwidth savings compared with the DJSCC-based SC system over the additive Gaussian white noise (AWGN) channel at 00 dB signal-to-noise ratio (SNR).

Refer to caption
Figure 1: An application scenario for VideoQA-SC.

The rest of this paper is organized as follows. The system model and the process for performing VideoQA tasks are introduced in Section 1. We explain our proposed methods and the detailed network architectures in Section 2. Section 3 provides the quantified experimental results and the comparison with existing advanced methods. Finally, Section 4 summarizes this paper and gives conclusions.

Notations: In this paper, lowercase letters, e.g., x𝑥{x}italic_x, denote scalars. Bold lowercase letters, e.g., 𝐱𝐱\mathbf{x}bold_x, denote vectors and bold uppercase letters, e.g., 𝐗𝐗\mathbf{X}bold_X, denote matrices or tensors. 𝐈𝐈\mathbf{I}bold_I denotes the identity matrix. 𝒞𝒩(μ,σ2)𝒞𝒩𝜇superscript𝜎2\mathcal{CN}(\mu,\sigma^{2})caligraphic_C caligraphic_N ( italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and 𝒩(μ,σ2)𝒩𝜇superscript𝜎2\mathcal{N}(\mu,\sigma^{2})caligraphic_N ( italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) denote the complex Gaussian distribution and the standard Gaussian distribution with mean μ𝜇\muitalic_μ and covariance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, respectively. log2subscript2\log_{2}roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denotes the logarithm to base 2222 and log\logroman_log denotes the natural logarithm. ()Tsuperscript𝑇(\cdot)^{T}( ⋅ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT denotes the transpose and ()superscript(\cdot)^{*}( ⋅ ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denotes the conjugate transpose. \mathbb{R}blackboard_R and \mathbb{C}blackboard_C denote the real set and the complex set, respectively. 𝔼[]𝔼delimited-[]\mathbb{E}[\cdot]blackboard_E [ ⋅ ] denotes the statistical expectation operation. Uniform(a,b)Uniform𝑎𝑏\operatorname{Uniform}(a,b)roman_Uniform ( italic_a , italic_b ) denotes the uniform distribution with start a𝑎aitalic_a and end b𝑏bitalic_b.

1 System Model

In this section, we introduce the VideoQA-SC workflow to perform VideoQA tasks and establish an optimization model for the entire system with bandwidth constraints.

Fig. 1 shows an application scenario for VideoQA-SC. There are many terminal devices simultaneously requesting access to the same surveillance videos with different questions. The transmitter, e.g., edge server, extracts video semantics containing comprehensive video contents and sends them to all terminal devices. Then, each terminal device independently completes VideoQA to predict its own answer.

Our work focuses on the multi-choice VideoQA tasks. Given the video 𝐗vlv×3×x×ysubscript𝐗𝑣superscriptsubscript𝑙𝑣3𝑥𝑦\mathbf{X}_{v}\in\mathbb{R}^{l_{v}\times{3}\times{x}\times{y}}bold_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × 3 × italic_x × italic_y end_POSTSUPERSCRIPT with lvsubscript𝑙𝑣l_{v}italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT frames and the question 𝐗qlq×dqsubscript𝐗𝑞superscriptsubscript𝑙𝑞subscript𝑑𝑞\mathbf{X}_{q}\in\mathbb{R}^{l_{q}\times{d}_{q}}bold_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with lqsubscript𝑙𝑞l_{q}italic_l start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT tokens, VideoQA aims to predict an answer a𝑎aitalic_a by exploiting both video and text information:

a=argmaxa𝒜o𝝎(a|𝐗q,𝐗v,𝒜),superscript𝑎subscript𝑎𝒜subscript𝑜𝝎conditional𝑎subscript𝐗𝑞subscript𝐗𝑣𝒜a^{\star}=\mathop{\arg\max}\limits_{a\in{\mathcal{A}}}o_{\boldsymbol{\omega}}(% a|\mathbf{X}_{q},\mathbf{X}_{v},\mathcal{A}),italic_a start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT bold_italic_ω end_POSTSUBSCRIPT ( italic_a | bold_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , caligraphic_A ) , (1)

where asuperscript𝑎a^{\star}italic_a start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT is the predicted answer chosen from the candidate answers, i.e., multiple choices or a predefined global answer set, denoted as 𝒜𝒜\mathcal{A}caligraphic_A, and o𝝎()subscript𝑜𝝎o_{\boldsymbol{\omega}}(\cdot)italic_o start_POSTSUBSCRIPT bold_italic_ω end_POSTSUBSCRIPT ( ⋅ ) is the VideoQA model with the learnable vector 𝝎𝝎\boldsymbol{\omega}bold_italic_ω.

Refer to caption
Figure 2: Overview of the proposed VideoQA-SC.

As illustrated in Fig. 2, the whole process of VideoQA-SC mainly includes 3 parts:

1.1 Transmitter

The transmitter first extracts the low-dimensional semantics 𝐘vlv×dsubscript𝐘𝑣superscriptsubscript𝑙𝑣𝑑\mathbf{Y}_{v}\in\mathbb{R}^{l_{v}\times{d}}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT from the input video 𝐗vsubscript𝐗𝑣\mathbf{X}_{v}bold_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT using the spatiotemporal semantic encoder g(;𝜻)𝑔𝜻g(\cdot;\boldsymbol{\zeta})italic_g ( ⋅ ; bold_italic_ζ ) with the learnable parameter vector 𝜻𝜻\boldsymbol{\zeta}bold_italic_ζ. Then, the JSC encoder with rate predictors fe(;𝜽,ϵ)subscript𝑓𝑒𝜽bold-italic-ϵf_{e}(\cdot;\boldsymbol{\theta},\boldsymbol{\epsilon})italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( ⋅ ; bold_italic_θ , bold_italic_ϵ ) processes 𝐘vsubscript𝐘𝑣\mathbf{Y}_{v}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT into 𝐒vlv×dsubscript𝐒𝑣superscriptsubscript𝑙𝑣𝑑\mathbf{S}_{v}\in\mathbb{R}^{l_{v}\times{d}}bold_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT, part of whose channels are masked as zero. 𝜽𝜽\boldsymbol{\theta}bold_italic_θ and ϵbold-italic-ϵ\boldsymbol{\epsilon}bold_italic_ϵ are the learnable parameter vectors of the JSC encoder and rate predictors, respectively. Non-zero channels of 𝐒vsubscript𝐒𝑣\mathbf{S}_{v}bold_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT are flattened into continuous-valued real symbols. Finally, complex channel input symbols 𝐬vcnsuperscriptsubscript𝐬𝑣𝑐superscript𝑛\mathbf{s}_{v}^{c}\in\mathbb{C}^{n}bold_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are obtained by converting each two real symbols into one complex symbol. The process of 𝐗vsubscript𝐗𝑣\mathbf{X}_{v}bold_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT at the transmitter can be expressed as:

𝐬vc=R2C(Flatten(fe(g(𝐗v;𝜻);𝜽,ϵ))),superscriptsubscript𝐬𝑣𝑐R2CFlattensubscript𝑓𝑒𝑔subscript𝐗𝑣𝜻𝜽bold-italic-ϵ\mathbf{s}_{v}^{c}=\operatorname{R2C}(\operatorname{Flatten}(f_{e}(g(\mathbf{X% }_{v};\boldsymbol{\zeta});\boldsymbol{\theta},\boldsymbol{\epsilon}))),bold_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = R2C ( roman_Flatten ( italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_g ( bold_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ; bold_italic_ζ ) ; bold_italic_θ , bold_italic_ϵ ) ) ) , (2)

where R2C()R2C\operatorname{R2C}(\cdot)R2C ( ⋅ ) and Flatten()Flatten\operatorname{Flatten}(\cdot)roman_Flatten ( ⋅ ) are the real-to-complex and flattening operations, respectively. 1n𝐬vc(𝐬vc)11𝑛superscriptsubscript𝐬𝑣𝑐superscriptsuperscriptsubscript𝐬𝑣𝑐1\frac{1}{n}\mathbf{s}_{v}^{c}(\mathbf{s}_{v}^{c})^{*}\leq{1}divide start_ARG 1 end_ARG start_ARG italic_n end_ARG bold_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≤ 1 is imposed to satisfy the average power constraint at the transmitter.

Here, we define R=nlv×3×x×y1𝑅𝑛subscript𝑙𝑣3𝑥𝑦1R=\frac{n}{l_{v}\times{3}\times{x}\times{y}}\leq 1italic_R = divide start_ARG italic_n end_ARG start_ARG italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × 3 × italic_x × italic_y end_ARG ≤ 1 as the bandwidth compression ratio (BCR), which represents the average length of channel input symbols encoded for each source symbol.

1.2 Channel

The encoded channel input symbols 𝐬vcsuperscriptsubscript𝐬𝑣𝑐\mathbf{s}_{v}^{c}bold_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT are transmitted over the noisy wireless channel. For the AWGN channel, the received symbols can be expressed as:

𝐬^vc=𝐬vc+𝐧,superscriptsubscript^𝐬𝑣𝑐superscriptsubscript𝐬𝑣𝑐𝐧\mathbf{\hat{s}}_{v}^{c}=\mathbf{s}_{v}^{c}+\mathbf{n},over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = bold_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + bold_n , (3)

where 𝐧n𝐧superscript𝑛\mathbf{n}\in\mathbb{C}^{n}bold_n ∈ blackboard_C start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT consists of independent and identically distributed (i.i.d.) samples that follow 𝒞𝒩(0,σn2𝐈)𝒞𝒩0subscriptsuperscript𝜎2𝑛𝐈\mathcal{CN}(0,\sigma^{2}_{n}\mathbf{I})caligraphic_C caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_I ). σn2subscriptsuperscript𝜎2𝑛\sigma^{2}_{n}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes the average noise power. For block fading channels, an additional channel gain h1superscript1h\in\mathbb{C}^{1}italic_h ∈ blackboard_C start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT is introduced for each 𝐬vcsuperscriptsubscript𝐬𝑣𝑐\mathbf{s}_{v}^{c}bold_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT:

𝐬^vc=h𝐬vc+𝐧.superscriptsubscript^𝐬𝑣𝑐superscriptsubscript𝐬𝑣𝑐𝐧\mathbf{\hat{s}}_{v}^{c}=h\mathbf{s}_{v}^{c}+\mathbf{n}.over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = italic_h bold_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + bold_n . (4)

1.3 Receiver

The received complex channel output symbols 𝐬^vcsuperscriptsubscript^𝐬𝑣𝑐\mathbf{\hat{s}}_{v}^{c}over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT are first converted to real symbols, which are further unflattened and padded with zeros to form 𝐒^vlv×dsubscript^𝐒𝑣superscriptsubscript𝑙𝑣𝑑\mathbf{\hat{S}}_{v}\in\mathbb{R}^{l_{v}\times{d}}over^ start_ARG bold_S end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT. The receiver then decodes the video semantics 𝐘^vlv×dsubscript^𝐘𝑣superscriptsubscript𝑙𝑣𝑑\mathbf{\hat{Y}}_{v}\in\mathbb{R}^{l_{v}\times{d}}over^ start_ARG bold_Y end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT from 𝐒^vsubscript^𝐒𝑣\mathbf{\hat{S}}_{v}over^ start_ARG bold_S end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT by the JSC decoder fd(;ϕ)subscript𝑓𝑑bold-italic-ϕf_{d}(\cdot;\boldsymbol{\phi})italic_f start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( ⋅ ; bold_italic_ϕ ) with the learnable parameter vector ϕbold-italic-ϕ\boldsymbol{\phi}bold_italic_ϕ. Subsequently, the receiver utilizes the multimodal fuser u(;𝝂)𝑢𝝂u(\cdot;\boldsymbol{\nu})italic_u ( ⋅ ; bold_italic_ν ) with the learnable parameter vector 𝝂𝝂\boldsymbol{\nu}bold_italic_ν to interact with video and text information and predicts the corresponding answer asuperscript𝑎a^{\star}italic_a start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT to the question 𝐗qsubscript𝐗𝑞\mathbf{X}_{q}bold_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. The process of answer prediction at the receiver can be expressed as:

a=u(fd(Pad(Unflatten(C2R(𝐬^vc)));ϕ),𝐗q;𝝂),superscript𝑎𝑢subscript𝑓𝑑PadUnflattenC2Rsuperscriptsubscript^𝐬𝑣𝑐bold-italic-ϕsubscript𝐗𝑞𝝂a^{\star}=u(f_{d}(\operatorname{Pad}(\operatorname{Unflatten}(\operatorname{C2% R}(\mathbf{\hat{s}}_{v}^{c})));\boldsymbol{\phi}),\mathbf{X}_{q};\boldsymbol{% \nu}),italic_a start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = italic_u ( italic_f start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( roman_Pad ( roman_Unflatten ( C2R ( over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ) ) ; bold_italic_ϕ ) , bold_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ; bold_italic_ν ) , (5)

where Pad()Pad\operatorname{Pad}(\cdot)roman_Pad ( ⋅ ), Unflatten()Unflatten\operatorname{Unflatten}(\cdot)roman_Unflatten ( ⋅ ) and C2R()C2R\operatorname{C2R}(\cdot)C2R ( ⋅ ) denote zero-padding, unflattening and complex-to-real operations, respectively.

The goal of VideoQA-SC is to maximize the average accuracy of VideoQA on testing data for a given bandwidth B𝐵Bitalic_B by optimizing all learnable parameter vectors 𝜻𝜻\boldsymbol{\zeta}bold_italic_ζ, 𝜽𝜽\boldsymbol{\theta}bold_italic_θ, ϵbold-italic-ϵ\boldsymbol{\epsilon}bold_italic_ϵ, ϕbold-italic-ϕ\boldsymbol{\phi}bold_italic_ϕ and 𝝂𝝂\boldsymbol{\nu}bold_italic_ν, which can be formulated as:

max𝜻,𝜽,ϵ,ϕ,𝝂𝔼[ACC(a,alabel)]s.t.RB,subscript𝜻𝜽bold-italic-ϵbold-italic-ϕ𝝂𝔼delimited-[]ACC𝑎subscript𝑎labelformulae-sequencest𝑅𝐵\displaystyle\begin{aligned} \max_{\boldsymbol{\zeta},\boldsymbol{\theta},% \boldsymbol{\epsilon},\boldsymbol{\phi},\boldsymbol{\nu}}&\quad\mathbb{E}[% \operatorname{ACC}(a,a_{\text{label}})]\\ \mathrm{s.t.}&\quad R\leq{B},\end{aligned}start_ROW start_CELL roman_max start_POSTSUBSCRIPT bold_italic_ζ , bold_italic_θ , bold_italic_ϵ , bold_italic_ϕ , bold_italic_ν end_POSTSUBSCRIPT end_CELL start_CELL blackboard_E [ roman_ACC ( italic_a , italic_a start_POSTSUBSCRIPT label end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL roman_s . roman_t . end_CELL start_CELL italic_R ≤ italic_B , end_CELL end_ROW (6)

where ACC(,)ACC\operatorname{ACC}(\cdot,\cdot)roman_ACC ( ⋅ , ⋅ ) is an indicator function (1 only if a=alabel𝑎subscript𝑎labela=a_{\text{label}}italic_a = italic_a start_POSTSUBSCRIPT label end_POSTSUBSCRIPT and 0 otherwise).

Refer to caption
Figure 3: Temporal and spatial modeling of the i𝑖iitalic_i-th video clip. Shapes with the same color represent the same objects in different frames.

2 The Proposed Method

In this section, we describe in detail the proposed VideoQA-SC system, including the spatiotemporal semantic encoder, cross-attention based JSC encoder/decoder and learning-based adaptive bandwidth allocation. Then, we introduce the training strategy of VideoQA-SC.

2.1 Spatiotemporal Semantic Encoder

We develop a spatiotemporal semantic encoder g(;𝜻)𝑔𝜻g(\cdot;\boldsymbol{\zeta})italic_g ( ⋅ ; bold_italic_ζ ) to extract the video semantics by modeling the spatial and temporal correlations of the video. The purpose of g(;𝜻)𝑔𝜻g(\cdot;\boldsymbol{\zeta})italic_g ( ⋅ ; bold_italic_ζ ) is to extract the coarse-grained semantics that is beneficial to fully understand the video content. In this way, although there may be multiple receivers with different inquiries, they are able to discern the video content pertinent to their specific questions through the processed video semantics, enabling them to perform their own analysis without video recovery.

Given that consecutive frames in a video typically have identical backgrounds, substantial spatial and temporal redundancies exist, necessitating removal to enhance semantic extraction. Inspired by VideoQA works that extracted frame features and motion features to get video representations [25, 26, 27], the proposed spatiotemporal semantic encoder g(;𝜻)𝑔𝜻g(\cdot;\boldsymbol{\zeta})italic_g ( ⋅ ; bold_italic_ζ ) operates mainly on object-level and frame-level features to capture the changes of visual objects while removing redundancy, thereby extracting compact and comprehensive semantic representations.

To reduce the use of computational and transmission resources, we initially apply uniform interval sparse sampling across each video, choosing lvsubscript𝑙𝑣l_{v}italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT frames to form the keyframes that make up 𝐗vsubscript𝐗𝑣\mathbf{X}_{v}bold_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. The lvsubscript𝑙𝑣l_{v}italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT keyframes are divided into lcsubscript𝑙𝑐l_{c}italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT clips, each with a length of lfsubscript𝑙𝑓l_{f}italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT frames. For simplicity, we assume that each clip operates independently, and g(;𝜻)𝑔𝜻g(\cdot;\boldsymbol{\zeta})italic_g ( ⋅ ; bold_italic_ζ ) is designed to only capture the correlations within the frames of a single clip and produce semantics for the clip.

The process of temporal and spatial modeling is illustrated in Fig. 3. For the i𝑖iitalic_i-th clip 𝒳i={𝐗i,1,𝐗i,2,,𝐗i,lf}subscript𝒳𝑖subscript𝐗𝑖1subscript𝐗𝑖2subscript𝐗𝑖subscript𝑙𝑓\mathcal{X}_{i}=\left\{\mathbf{X}_{i,1},\mathbf{X}_{i,2},\cdots,\mathbf{X}_{i,% l_{f}}\right\}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { bold_X start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT , ⋯ , bold_X start_POSTSUBSCRIPT italic_i , italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT }, 𝐗i,j3×x×ysubscript𝐗𝑖𝑗superscript3𝑥𝑦\mathbf{X}_{i,j}\in\mathbb{R}^{3\times{x}\times{y}}bold_X start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_x × italic_y end_POSTSUPERSCRIPT represents the j𝑗jitalic_j-th frame in the clip. We first use a pre-trained object detector and ResNet to process all 𝐗i,jsubscript𝐗𝑖𝑗\mathbf{X}_{i,j}bold_X start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT (j[1,lf])𝑗1subscript𝑙𝑓(j\in{\left[1,l_{f}\right]})( italic_j ∈ [ 1 , italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ] ), obtaining the object-level feature 𝐎i,jr×msubscript𝐎𝑖𝑗superscript𝑟𝑚\mathbf{O}_{i,j}\in{\mathbb{R}^{r\times{m}}}bold_O start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_m end_POSTSUPERSCRIPT and frame-level feature 𝐅i,jmsubscript𝐅𝑖𝑗superscript𝑚\mathbf{F}_{i,j}\in{\mathbb{R}^{m}}bold_F start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, respectively. Let r𝑟ritalic_r be the number of detected objects in 𝐗i,jsubscript𝐗𝑖𝑗\mathbf{X}_{i,j}bold_X start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT and m𝑚mitalic_m be the channel dimension of both the object-level features and the frame-level features. We concatenate all 𝐎i,jsubscript𝐎𝑖𝑗\mathbf{O}_{i,j}bold_O start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT and 𝐅i,jsubscript𝐅𝑖𝑗\mathbf{F}_{i,j}bold_F start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT to get the i𝑖iitalic_i-th clip object-level feature 𝐎ilf×r×msubscript𝐎𝑖superscriptsubscript𝑙𝑓𝑟𝑚\mathbf{O}_{i}\in{\mathbb{R}^{l_{f}\times{r}\times{m}}}bold_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_r × italic_m end_POSTSUPERSCRIPT and the i𝑖iitalic_i-th clip frame-level feature 𝐅ilf×msubscript𝐅𝑖superscriptsubscript𝑙𝑓𝑚\mathbf{F}_{i}\in{\mathbb{R}^{l_{f}\times{m}}}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_m end_POSTSUPERSCRIPT, respectively.

Then, each 𝐎isubscript𝐎𝑖\mathbf{O}_{i}bold_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is fed into stacked Transformer blocks to facilitate the interaction of the same object features in different frames in the i𝑖iitalic_i-th clip. Here, the number of frames in the clip corresponds to the sequence length in the original Transformer. By using self-attention mechanism, we can obtain an aggregated representation of each object 𝐎i,j,kmsubscript𝐎𝑖𝑗𝑘superscript𝑚\mathbf{O}_{i,j,k}\in\mathbb{R}^{m}bold_O start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT (k[1,r])𝑘1𝑟(k\in\left[1,r\right])( italic_k ∈ [ 1 , italic_r ] ) within one clip to learn the object-level temporal correlations. The aggregation can be expressed as:

p=1lfαi,k,j,p𝐎i,p,k𝐎i,j,k,superscriptsubscript𝑝1subscript𝑙𝑓subscript𝛼𝑖𝑘𝑗𝑝subscript𝐎𝑖𝑝𝑘subscript𝐎𝑖𝑗𝑘\sum_{p=1}^{l_{f}}\alpha_{i,k,j,p}\mathbf{O}_{i,p,k}\to\mathbf{O}_{i,j,k},∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i , italic_k , italic_j , italic_p end_POSTSUBSCRIPT bold_O start_POSTSUBSCRIPT italic_i , italic_p , italic_k end_POSTSUBSCRIPT → bold_O start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT , (7)

where αi,k,j,psubscript𝛼𝑖𝑘𝑗𝑝\alpha_{i,k,j,p}italic_α start_POSTSUBSCRIPT italic_i , italic_k , italic_j , italic_p end_POSTSUBSCRIPT is the attention score of aggregated 𝐎i,j,ksubscript𝐎𝑖𝑗𝑘\mathbf{O}_{i,j,k}bold_O start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT to 𝐎i,p,ksubscript𝐎𝑖𝑝𝑘\mathbf{O}_{i,p,k}bold_O start_POSTSUBSCRIPT italic_i , italic_p , italic_k end_POSTSUBSCRIPT.

Subsequently, following the work [27], we construct a graph 𝒢isubscript𝒢𝑖\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for every 𝐎i,jsubscript𝐎𝑖𝑗\mathbf{O}_{i,j}bold_O start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, where each object in 𝐎i,jsubscript𝐎𝑖𝑗\mathbf{O}_{i,j}bold_O start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is a node in 𝒢isubscript𝒢𝑖\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Based on the constructed graph, we perform graph convolution operations on every 𝐎i,jsubscript𝐎𝑖𝑗\mathbf{O}_{i,j}bold_O start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, exploiting the structural information between different objects in one frame to utilize spatial interaction. Finally, after average pooling along the object dimension, the processed object-level feature 𝐎ilf×msubscriptsuperscript𝐎𝑖superscriptsubscript𝑙𝑓𝑚\mathbf{O^{\prime}}_{i}\in\mathbb{R}^{l_{f}\times{m}}bold_O start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_m end_POSTSUPERSCRIPT is concatenated with the frame-level feature 𝐅isubscript𝐅𝑖\mathbf{F}_{i}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The concatenated feature is fed into a linear layer to map to the video semantics 𝐘v,ilf×dsubscript𝐘𝑣𝑖superscriptsubscript𝑙𝑓𝑑\mathbf{Y}_{v,i}\in\mathbb{R}^{l_{f}\times{d}}bold_Y start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT. All clip-level representations 𝐘v,isubscript𝐘𝑣𝑖\mathbf{Y}_{v,i}bold_Y start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT are concatenated to obtain the entire video semantics 𝐘vlv×dsubscript𝐘𝑣superscriptsubscript𝑙𝑣𝑑\mathbf{Y}_{v}\in\mathbb{R}^{l_{v}\times{d}}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT.

Refer to caption
Figure 4: The structure of the dual-branch cross-attention Transformer block in the JSC encoder/decoder.

2.2 Cross-Attention Based DJSCC Transmission

We apply DJSCC technology to VideoQA-SC and design a symmetric JSC encoder/decoder to overcome wireless channel degradation, enabling accurate transmission of video semantics 𝐘vsubscript𝐘𝑣\mathbf{Y}_{v}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. The learnable embedding shared between the transmitter and receiver is developed, which progressively refines 𝐘vsubscript𝐘𝑣\mathbf{Y}_{v}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT in the form of cross-attention during the JSC encoding and decoding processes.

We use Transformer to construct the backbone of both the JSC encoder and decoder. In the Transformer architecture, the self-attention mechanism utilizes the learnable matrix 𝐖QKVsuperscript𝐖𝑄𝐾𝑉\mathbf{W}^{QKV}bold_W start_POSTSUPERSCRIPT italic_Q italic_K italic_V end_POSTSUPERSCRIPT to generate the query (𝐐𝐐\mathbf{Q}bold_Q), key (𝐊𝐊\mathbf{K}bold_K), and value (𝐕𝐕\mathbf{V}bold_V) representations from embeddings. This process enables weighted information aggregation, effectively capturing dependencies between different embeddings in the sequence. In DJSCC, code rate guidance allows the refinement of the latent representations, thereby generating channel input symbols adapted to bandwidth constraints, which motivates us to provide code rate guidance to the encoding and decoding processes of video semantics.

Different from directly using the human-defined code rate, we design the code rate as learnable parameters and interact with the video semantics in the form of cross attention. Specifically, we introduce a learnable parameter tensor, with the same shape as 𝐘vsubscript𝐘𝑣\mathbf{Y}_{v}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, termed rate embedding 𝐘ratesubscript𝐘rate\mathbf{Y}_{\text{rate}}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT, in the JSC encoder fe(;𝜽,ϵ)subscript𝑓𝑒𝜽bold-italic-ϵf_{e}(\cdot;\boldsymbol{\theta},\boldsymbol{\epsilon})italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( ⋅ ; bold_italic_θ , bold_italic_ϵ ) and the JSC decoder fd(;ϕ)subscript𝑓𝑑bold-italic-ϕf_{d}(\cdot;\boldsymbol{\phi})italic_f start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( ⋅ ; bold_italic_ϕ ). The encoding and decoding processes of 𝐘vsubscript𝐘𝑣\mathbf{Y}_{v}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT are both guided by 𝐘ratesubscript𝐘rate\mathbf{Y}_{\text{rate}}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT shared between the transmitter and receiver. Furthermore, 𝐘ratesubscript𝐘rate\mathbf{Y}_{\text{rate}}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT can aid the JSC encoder in achieving variable-length coding of 𝐘vsubscript𝐘𝑣\mathbf{Y}_{v}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, which will be explained in detail in Section 2.3.

Since the JSC encoder and decoder have similar network structures, we take the process of the JSC encoder as an example to describe the interaction between 𝐘ratesubscript𝐘rate\mathbf{Y}_{\text{rate}}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT and 𝐘vsubscript𝐘𝑣\mathbf{Y}_{v}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. As illustrated in Fig. 4, the proposed cross-attention Transformer block consists of two symmetric branches (rate branch and feature branch) to process 𝐘ratesubscript𝐘rate\mathbf{Y}_{\text{rate}}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT and 𝐘vsubscript𝐘𝑣\mathbf{Y}_{v}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, respectively. Each branch is a standard Transformer block in Vision Transformer.

Starting from the projection of two embeddings,𝐘ratelv×dsubscript𝐘ratesuperscriptsubscript𝑙𝑣𝑑\mathbf{Y}_{\text{rate}}\in{\mathbb{R}^{l_{v}\times{d}}}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT is transformed into 𝐐ratelv×dsubscript𝐐ratesuperscriptsubscript𝑙𝑣𝑑\mathbf{Q}_{\text{rate}}\in\mathbb{R}^{l_{v}\times{d}}bold_Q start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT, 𝐊ratelv×dsubscript𝐊ratesuperscriptsubscript𝑙𝑣𝑑\mathbf{K}_{\text{rate}}\in\mathbb{R}^{l_{v}\times{d}}bold_K start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT and 𝐕ratelv×dsubscript𝐕ratesuperscriptsubscript𝑙𝑣𝑑\mathbf{V}_{\text{rate}}\in\mathbb{R}^{l_{v}\times{d}}bold_V start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT by 𝐖rateQKVsubscriptsuperscript𝐖𝑄𝐾𝑉rate\mathbf{W}^{QKV}_{\text{rate}}bold_W start_POSTSUPERSCRIPT italic_Q italic_K italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT, and 𝐘vlv×dsubscript𝐘𝑣superscriptsubscript𝑙𝑣𝑑\mathbf{Y}_{v}\in{\mathbb{R}^{l_{v}\times{d}}}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT is transformed into 𝐐vlv×dsubscript𝐐𝑣superscriptsubscript𝑙𝑣𝑑\mathbf{Q}_{v}\in\mathbb{R}^{l_{v}\times{d}}bold_Q start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT, 𝐊vlv×dsubscript𝐊𝑣superscriptsubscript𝑙𝑣𝑑\mathbf{K}_{v}\in\mathbb{R}^{l_{v}\times{d}}bold_K start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT and 𝐕vlv×dsubscript𝐕𝑣superscriptsubscript𝑙𝑣𝑑\mathbf{V}_{v}\in\mathbb{R}^{l_{v}\times{d}}bold_V start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT by 𝐖vQKVsubscriptsuperscript𝐖𝑄𝐾𝑉𝑣\mathbf{W}^{QKV}_{v}bold_W start_POSTSUPERSCRIPT italic_Q italic_K italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. The proposed cross-attention mechanism utilizes the scaled dot-product of 𝐐ratesubscript𝐐rate\mathbf{Q}_{\text{rate}}bold_Q start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT and 𝐊vsubscript𝐊𝑣\mathbf{K}_{v}bold_K start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT to generate attention for the rate branch and the scaled dot-product of 𝐐vsubscript𝐐𝑣\mathbf{Q}_{v}bold_Q start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and 𝐊ratesubscript𝐊rate\mathbf{K}_{\text{rate}}bold_K start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT to generate attention for the feature branch. The implementation of cross-attention for 𝐘ratesubscript𝐘rate\mathbf{Y}_{\text{rate}}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT to 𝐘vsubscript𝐘𝑣\mathbf{Y}_{v}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and 𝐘vsubscript𝐘𝑣\mathbf{Y}_{v}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT to 𝐘ratesubscript𝐘rate\mathbf{Y}_{\text{rate}}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT can be formulated as:

CA(𝐘rate,𝐘v)=softmax(𝐐rate𝐊vTd)𝐕v,CAsubscript𝐘ratesubscript𝐘𝑣softmaxsubscript𝐐ratesubscriptsuperscript𝐊𝑇𝑣𝑑subscript𝐕𝑣\operatorname{CA}(\mathbf{Y}_{\text{rate}},\mathbf{Y}_{v})=\operatorname{% softmax}\left(\frac{\mathbf{Q}_{\text{rate}}\mathbf{K}^{T}_{v}}{\sqrt{d}}% \right)\mathbf{V}_{v},roman_CA ( bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT , bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) = roman_softmax ( divide start_ARG bold_Q start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT bold_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) bold_V start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , (8)

and

CA(𝐘v,𝐘rate)=softmax(𝐐v𝐊rateTd)𝐕rate,CAsubscript𝐘𝑣subscript𝐘ratesoftmaxsubscript𝐐𝑣subscriptsuperscript𝐊𝑇rate𝑑subscript𝐕rate\operatorname{CA}(\mathbf{Y}_{v},\mathbf{Y}_{\text{rate}})=\operatorname{% softmax}\left(\frac{\mathbf{Q}_{v}\mathbf{K}^{T}_{\text{rate}}}{\sqrt{d}}% \right)\mathbf{V}_{\text{rate}},roman_CA ( bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT ) = roman_softmax ( divide start_ARG bold_Q start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT bold_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) bold_V start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT , (9)

respectively.

Then, the process of the entire Transformer block for the rate branch and the feature branch can be formulated as:

𝐘~ratel=MHCA(LN(𝐘rate),LN(𝐘v))+𝐘ratel1,𝐘ratel=FFN(LN(𝐘~ratel))+LN(𝐘~ratel),superscriptsubscript~𝐘rate𝑙absentMHCALNsubscript𝐘rateLNsubscript𝐘𝑣superscriptsubscript𝐘rate𝑙1superscriptsubscript𝐘rate𝑙absentFFNLNsuperscriptsubscript~𝐘rate𝑙LNsuperscriptsubscript~𝐘rate𝑙\displaystyle\begin{aligned} \tilde{\mathbf{Y}}_{\text{rate}}^{l}&=% \operatorname{MHCA}(\operatorname{LN}(\mathbf{Y}_{\text{rate}}),\operatorname{% LN}(\mathbf{Y}_{v}))+\mathbf{Y}_{\text{rate}}^{l-1},\\ \mathbf{Y}_{\text{rate}}^{l}&=\operatorname{FFN}(\operatorname{LN}(\tilde{% \mathbf{Y}}_{\text{rate}}^{l}))+\operatorname{LN}(\tilde{\mathbf{Y}}_{\text{% rate}}^{l}),\end{aligned}start_ROW start_CELL over~ start_ARG bold_Y end_ARG start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_CELL start_CELL = roman_MHCA ( roman_LN ( bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT ) , roman_LN ( bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) ) + bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_CELL start_CELL = roman_FFN ( roman_LN ( over~ start_ARG bold_Y end_ARG start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ) + roman_LN ( over~ start_ARG bold_Y end_ARG start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) , end_CELL end_ROW (10)

and

𝐘~vl=MHCA(LN(𝐘v),LN(𝐘rate))+𝐘vl1,𝐘vl=FFN(LN(𝐘~vl))+LN(𝐘~vl),superscriptsubscript~𝐘𝑣𝑙absentMHCALNsubscript𝐘𝑣LNsubscript𝐘ratesuperscriptsubscript𝐘𝑣𝑙1superscriptsubscript𝐘𝑣𝑙absentFFNLNsuperscriptsubscript~𝐘𝑣𝑙LNsuperscriptsubscript~𝐘𝑣𝑙\displaystyle\begin{aligned} \tilde{\mathbf{Y}}_{v}^{l}&=\operatorname{MHCA}(% \operatorname{LN}(\mathbf{Y}_{v}),\operatorname{LN}(\mathbf{Y}_{\text{rate}}))% +\mathbf{Y}_{v}^{l-1},\\ \mathbf{Y}_{v}^{l}&=\operatorname{FFN}(\operatorname{LN}(\tilde{\mathbf{Y}}_{v% }^{l}))+\operatorname{LN}(\tilde{\mathbf{Y}}_{v}^{l}),\end{aligned}start_ROW start_CELL over~ start_ARG bold_Y end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_CELL start_CELL = roman_MHCA ( roman_LN ( bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) , roman_LN ( bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT ) ) + bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_CELL start_CELL = roman_FFN ( roman_LN ( over~ start_ARG bold_Y end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ) + roman_LN ( over~ start_ARG bold_Y end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) , end_CELL end_ROW (11)

respectively. In Eq. (10) and (11), 𝐘ratel1superscriptsubscript𝐘rate𝑙1\mathbf{Y}_{\text{rate}}^{l-1}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT and 𝐘vl1superscriptsubscript𝐘𝑣𝑙1\mathbf{Y}_{v}^{l-1}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT denote the inputs of the l𝑙litalic_l-th Transformer block of the two branches, and 𝐘ratelsuperscriptsubscript𝐘rate𝑙\mathbf{Y}_{\text{rate}}^{l}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and 𝐘vlsuperscriptsubscript𝐘𝑣𝑙\mathbf{Y}_{v}^{l}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT denote the outputs of the l𝑙litalic_l-th Transformer blocks of the two branches. MHCA()MHCA\operatorname{MHCA}(\cdot)roman_MHCA ( ⋅ ) represents CA()CA\operatorname{CA}(\cdot)roman_CA ( ⋅ ) function with multi-head. LN()LN\operatorname{LN}(\cdot)roman_LN ( ⋅ ) represents the layer normalization in the Transformer. FFN()FFN\operatorname{FFN}(\cdot)roman_FFN ( ⋅ ) represents two linear layers with GeLU()GeLU\operatorname{GeLU}(\cdot)roman_GeLU ( ⋅ ) as the activation function.

The proposed symmetric dual-branch cross-attention Transformer block allows two types of embedding (𝐘ratesubscript𝐘rate\mathbf{Y}_{\text{rate}}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT and 𝐘vsubscript𝐘𝑣\mathbf{Y}_{v}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT) to interact information in the form of cross-attention, thereby promoting information flow across both branches. As 𝐘vl1superscriptsubscript𝐘𝑣𝑙1\mathbf{Y}_{v}^{l-1}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT is updated to 𝐘vlsuperscriptsubscript𝐘𝑣𝑙\mathbf{Y}_{v}^{l}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, 𝐘ratel1superscriptsubscript𝐘rate𝑙1\mathbf{Y}_{\text{rate}}^{l-1}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT is also updated to 𝐘ratelsuperscriptsubscript𝐘rate𝑙\mathbf{Y}_{\text{rate}}^{l}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, which provides dynamic rate guidance to scale each feature in 𝐘vlsuperscriptsubscript𝐘𝑣𝑙\mathbf{Y}_{v}^{l}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT in the next Transformer block. During the interaction of the two branches, 𝐘ratesubscript𝐘rate\mathbf{Y}_{\text{rate}}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT and 𝐘vsubscript𝐘𝑣\mathbf{Y}_{v}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT refine each other iteratively and finally contribute to the generation of real symbols 𝐒vsubscript𝐒𝑣\mathbf{S}_{v}bold_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT.

After flattening and the real-to-complex operation, the channel input symbols 𝐬vcsuperscriptsubscript𝐬𝑣𝑐\mathbf{s}_{v}^{c}bold_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT are transmitted through the noisy channel. The JSC decoder fd(;ϕ)subscript𝑓𝑑bold-italic-ϕf_{d}(\cdot;\boldsymbol{\phi})italic_f start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( ⋅ ; bold_italic_ϕ ) has the same structure as the JSC encoder, consisting of stacked dual-branch cross-attention Transformer blocks. The JSC decoder exploits the same rate embedding 𝐘ratesubscript𝐘rate\mathbf{Y}_{\text{rate}}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT and progressively decodes the video semantics 𝐘^vsubscript^𝐘𝑣\hat{\mathbf{Y}}_{v}over^ start_ARG bold_Y end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT based on the noisy channel output symbols 𝐬^vsubscript^𝐬𝑣\hat{\mathbf{s}}_{v}over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT at the receiver.

2.3 Learning-Based Adaptive Bandwidth Allocation

We propose a learning-based adaptive bandwidth allocation approach to generate channel input symbols of variable lengths, further improving the bandwidth efficiency of VideoQA-SC.

For the full use of limited bandwidth resources, flexible bandwidth allocation is required, e.g., more bandwidth for important information and less bandwidth for less important information. Statistical-based methods and learning-based methods can both be employed to measure the importance of features. Statistical-based methods, such as the feature entropy estimation [11], explicitly model the importance of features based on their entropy. Learning-based methods, e.g., distinguishing informative features and uninformative features with scaling factors in Batch Normalization layers [32], implicitly model the importance of features.

Refer to caption
Figure 5: The JSC encoder with adaptive bandwidth allocation. CA Transformer block denotes the proposed dual-branch cross-attention Transformer block in Section 2.2.

According to the cross-attention mechanism, 𝐘ratesubscript𝐘rate\mathbf{Y}_{\text{rate}}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT described in Section Section 2.2 can be seen as the score metric that dynamically scales elements of 𝐘vsubscript𝐘𝑣\mathbf{Y}_{v}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, causing us to measure the importance of features based on 𝐘ratesubscript𝐘rate\mathbf{Y}_{\text{rate}}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT. Our learning-based adaptive bandwidth allocation approach exploits 𝐘ratesubscript𝐘rate\mathbf{Y}_{\text{rate}}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT as guidance, which sparsifies the channels of each token output by the JSC encoder to generate channel input symbols of variable lengths. Specifically, given the output of the last dual-branch Transformer block 𝐘ratelv×dsubscript𝐘ratesuperscriptsubscript𝑙𝑣𝑑\mathbf{Y}_{\text{rate}}\in{\mathbb{R}^{l_{v}\times d}}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT and 𝐘vlv×dsubscript𝐘𝑣superscriptsubscript𝑙𝑣𝑑\mathbf{Y}_{v}\in{\mathbb{R}^{l_{v}\times d}}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT, we develop a series of rate predictors parameterized by ϵbold-italic-ϵ\boldsymbol{\epsilon}bold_italic_ϵ to predict the retained dimension for each token 𝐘v,idsubscript𝐘𝑣𝑖superscript𝑑\mathbf{Y}_{v,i}\in{\mathbb{R}^{d}}bold_Y start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT by 𝐘ratesubscript𝐘rate\mathbf{Y}_{\text{rate}}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT. Then, a binary mask matrix 𝐌lv×d𝐌superscriptsubscript𝑙𝑣𝑑\mathbf{M}\in{\mathbb{R}^{l_{v}\times d}}bold_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT is generated for channel masking, with the i𝑖iitalic_i-th row containing the first kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ones followed by (dki)𝑑subscript𝑘𝑖(d-k_{i})( italic_d - italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) zeros (ki[0,d]subscript𝑘𝑖0𝑑k_{i}\in{\left[0,d\right]}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 0 , italic_d ]). 𝐌𝐌\mathbf{M}bold_M is used to retain the first kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT channels of 𝐘v,isubscript𝐘𝑣𝑖\mathbf{Y}_{v,i}bold_Y start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT and mask the rest channels:

𝐘v𝐌𝐘v,direct-productsubscript𝐘𝑣𝐌subscript𝐘𝑣\mathbf{Y}_{v}\odot\mathbf{M}\to\mathbf{Y}_{v},bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ⊙ bold_M → bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , (12)

where direct-product\odot is Hadamard product.

To facilitate the learning of bandwidth allocation by neural networks, we set up q=log2d𝑞subscript2𝑑q=\log_{2}ditalic_q = roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d fixed candidate bandwidth for each token. In other words, for each 𝐘v,isubscript𝐘𝑣𝑖\mathbf{Y}_{v,i}bold_Y start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT, kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is selected from these q𝑞qitalic_q fixed values (ki{2,4,8,,d}subscript𝑘𝑖248𝑑k_{i}\in\left\{2,4,8,\cdots,d\right\}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 2 , 4 , 8 , ⋯ , italic_d }). Therefore, the channel sparsification problem can be seen as a classification problem in selecting the most suitable category from q𝑞qitalic_q categories for each token 𝐘v,isubscript𝐘𝑣𝑖\mathbf{Y}_{v,i}bold_Y start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT. However, the output of the rate predictors is the probability of q𝑞qitalic_q categories, which need to sample a specific “class” for each token. In E2E training, this sampling operation is non-differentiable, making it impossible to update the parameters of rate predictors through gradient descent. To overcome the problem, we employ the classical Gumbel-Softmax [33] trick to implement differentiable sampling operations. Next, we will elaborate on the process of channel masking.

As illustrated in Fig. 5, after every cross-attention Transformer block, a rate predictor is introduced to give a rate prediction for the current feature 𝐘vsubscript𝐘𝑣\mathbf{Y}_{v}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. Consider that we have L𝐿Litalic_L cross-attention Transformer blocks and L𝐿Litalic_L rate predictors. The l𝑙litalic_l-th (l[1,L1]𝑙1𝐿1l\in\left[1,L-1\right]italic_l ∈ [ 1 , italic_L - 1 ]) rate predictor takes the current rate embedding 𝐘ratelsuperscriptsubscript𝐘rate𝑙\mathbf{Y}_{\text{rate}}^{l}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT as input. First, the l𝑙litalic_l-th rate predictor projects 𝐘ratelsuperscriptsubscript𝐘rate𝑙\mathbf{Y}_{\text{rate}}^{l}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT using a linear layer to model its local information:

𝐙locall=Linear(𝐘ratel)lv×d,subscriptsuperscript𝐙𝑙localLinearsuperscriptsubscript𝐘rate𝑙superscriptsubscript𝑙𝑣𝑑\mathbf{Z}^{l}_{\text{local}}=\operatorname{Linear}(\mathbf{Y}_{\text{rate}}^{% l})\in{\mathbb{R}^{l_{v}\times d}},bold_Z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT local end_POSTSUBSCRIPT = roman_Linear ( bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT , (13)

where Linear()Linear\operatorname{Linear}(\cdot)roman_Linear ( ⋅ ) denotes the linear layer with GeLU()GeLU\operatorname{GeLU}(\cdot)roman_GeLU ( ⋅ ) activation function. We apply average pooling to 𝐙locallsubscriptsuperscript𝐙𝑙local\mathbf{Z}^{l}_{\text{local}}bold_Z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT local end_POSTSUBSCRIPT along the token dimension to obtain the global information:

𝐙globall=AveragePool(𝐙locall)1×d,subscriptsuperscript𝐙𝑙globalAveragePoolsuperscriptsubscript𝐙local𝑙superscript1𝑑\mathbf{Z}^{l}_{\text{global}}=\operatorname{AveragePool}(\mathbf{Z}_{\text{% local}}^{l})\in{\mathbb{R}^{1\times{d}}},bold_Z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT global end_POSTSUBSCRIPT = roman_AveragePool ( bold_Z start_POSTSUBSCRIPT local end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d end_POSTSUPERSCRIPT , (14)

where AveragePool()AveragePool\operatorname{AveragePool}(\cdot)roman_AveragePool ( ⋅ ) denotes the average pooling operation. After that, the rate predictor combines the local and global information along the channel dimension:

𝐙ratel=concat(𝐙locall,𝐙globall)lv×2d,superscriptsubscript𝐙rate𝑙concatsubscriptsuperscript𝐙𝑙localsubscriptsuperscript𝐙𝑙globalsuperscriptsubscript𝑙𝑣2𝑑\mathbf{Z}_{\text{rate}}^{l}=\operatorname{concat}(\mathbf{Z}^{l}_{\text{local% }},\mathbf{Z}^{l}_{\text{global}})\in{\mathbb{R}^{l_{v}\times{2d}}},bold_Z start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = roman_concat ( bold_Z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT local end_POSTSUBSCRIPT , bold_Z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT global end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × 2 italic_d end_POSTSUPERSCRIPT , (15)

where concat()concat\operatorname{concat(\cdot)}roman_concat ( ⋅ ) denotes the concatenation along the channel dimension. Then, 𝐙ratelsuperscriptsubscript𝐙rate𝑙\mathbf{Z}_{\text{rate}}^{l}bold_Z start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT are fed into multilayer perceptron (MLP) with softmax to get the l𝑙litalic_l-th decision score 𝐃llv×qsubscript𝐃𝑙superscriptsubscript𝑙𝑣𝑞\mathbf{D}_{l}\in{\mathbb{R}^{l_{v}\times q}}bold_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_q end_POSTSUPERSCRIPT:

𝐃l=Softmax(MLP(𝐙ratel)),subscript𝐃𝑙SoftmaxMLPsuperscriptsubscript𝐙rate𝑙\mathbf{D}_{l}=\operatorname{Softmax}(\operatorname{MLP}(\mathbf{Z}_{\text{% rate}}^{l})),bold_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = roman_Softmax ( roman_MLP ( bold_Z start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ) , (16)

where MLP()MLP\operatorname{MLP}(\cdot)roman_MLP ( ⋅ ) is MLP with stacked linear layers with GeLU()GeLU\operatorname{GeLU}(\cdot)roman_GeLU ( ⋅ ) activation function. Softmax()Softmax\operatorname{Softmax}(\cdot)roman_Softmax ( ⋅ ) denotes the softmax operation.

For the last rate predictor, all previous 𝐃lsubscript𝐃𝑙\mathbf{D}_{l}bold_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT(l[1,L1]𝑙1𝐿1l\in\left[1,L-1\right]italic_l ∈ [ 1 , italic_L - 1 ]) are used as additional inputs to help this rate predictor make the final decision 𝐃lv×q𝐃superscriptsubscript𝑙𝑣𝑞\mathbf{D}\in{\mathbb{R}^{l_{v}\times q}}bold_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_q end_POSTSUPERSCRIPT. The final prediction process can be formulated as:

𝐃=Softmax(MLP(concat(𝐙rateL,β(𝐃1𝐃L1)))),𝐃SoftmaxMLPconcatsuperscriptsubscript𝐙rate𝐿𝛽subscript𝐃1subscript𝐃𝐿1\mathbf{D}=\operatorname{Softmax}(\operatorname{MLP}(\operatorname{concat}(% \mathbf{Z}_{\text{rate}}^{L},\beta(\mathbf{D}_{1}\dots\mathbf{D}_{L-1})))),bold_D = roman_Softmax ( roman_MLP ( roman_concat ( bold_Z start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT , italic_β ( bold_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … bold_D start_POSTSUBSCRIPT italic_L - 1 end_POSTSUBSCRIPT ) ) ) ) , (17)

where β()𝛽\beta(\cdot)italic_β ( ⋅ ) is the function that aggregates the previous decision. β()𝛽\beta(\cdot)italic_β ( ⋅ ) can be attention-based aggregation or other aggregation methods. For simplicity, We utilize the average operation to implement β()𝛽\beta(\cdot)italic_β ( ⋅ ):

β(𝐃1𝐃L1)=1Li=1L1𝐃i.𝛽subscript𝐃1subscript𝐃𝐿11𝐿superscriptsubscript𝑖1𝐿1subscript𝐃𝑖\beta(\mathbf{D}_{1}\dots\mathbf{D}_{L-1})=\frac{1}{L}{\sum_{i=1}^{L-1}}% \mathbf{D}_{i}.italic_β ( bold_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … bold_D start_POSTSUBSCRIPT italic_L - 1 end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT bold_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (18)

Here, 𝐃𝐃\mathbf{D}bold_D represents the probability that each 𝐘v,isubscript𝐘𝑣𝑖\mathbf{Y}_{v,i}bold_Y start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT is classified into q𝑞qitalic_q different fixed bandwidths. Then, we need to introduce the Gumbel-Softmax trick to solve the non-differentiable sampling problem, which is often used in network pruning.

Given the d𝑑ditalic_d-dimensional token 𝐘v,isubscript𝐘𝑣𝑖\mathbf{Y}_{v,i}bold_Y start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT (i[1,lv]𝑖1subscript𝑙𝑣i\in\left[1,l_{v}\right]italic_i ∈ [ 1 , italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ]), we want to draw the sample 𝐏ihardqsuperscriptsubscript𝐏𝑖hardsuperscript𝑞\mathbf{P}_{i}^{\text{hard}}\in{\mathbb{R}^{q}}bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT hard end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT representing the chosen bandwidth from a categorical distribution with the class probability 𝐃iqsubscript𝐃𝑖superscript𝑞{\mathbf{D}_{i}}\in{\mathbb{R}^{q}}bold_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT. First, the Gumbel-Max trick formulates the sampling process as:

𝐏ihard=onehot(argmaxj(log(𝐃i,j)+𝐆i,j)),j[1,q],formulae-sequencesuperscriptsubscript𝐏𝑖hardonehotsubscript𝑗subscript𝐃𝑖𝑗subscript𝐆𝑖𝑗𝑗1𝑞\mathbf{P}_{i}^{\text{hard}}=\operatorname{onehot}(\mathop{\arg\max}\limits_{j% }({\log(\mathbf{D}_{i,j})+\mathbf{G}_{i,j}})),\;j\in\left[1,q\right],bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT hard end_POSTSUPERSCRIPT = roman_onehot ( start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( roman_log ( bold_D start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) + bold_G start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ) , italic_j ∈ [ 1 , italic_q ] , (19)

where all elements of 𝐆lv×q𝐆superscriptsubscript𝑙𝑣𝑞\mathbf{G}\in{\mathbb{R}^{l_{v}\times{q}}}bold_G ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_q end_POSTSUPERSCRIPT follow the Gumbel distribution Gumbel(0,1)Gumbel01\operatorname{Gumbel}(0,1)roman_Gumbel ( 0 , 1 ) and onehot()onehot\operatorname{onehot}(\cdot)roman_onehot ( ⋅ ) is the one-hot encoding function. 𝐆i,jsubscript𝐆𝑖𝑗\mathbf{G}_{i,j}bold_G start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT can be computed by:

𝐆i,j=log(log(𝐔i,j)),subscript𝐆𝑖𝑗subscript𝐔𝑖𝑗\mathbf{G}_{i,j}=-\log(-\log(\mathbf{U}_{i,j})),bold_G start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = - roman_log ( - roman_log ( bold_U start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ) , (20)

where 𝐔lv×q𝐔superscriptsubscript𝑙𝑣𝑞\mathbf{U}\in{\mathbb{R}^{l_{v}\times{q}}}bold_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_q end_POSTSUPERSCRIPT consists i.i.d. samples drawn from Uniform(0,1)Uniform01\operatorname{Uniform}(0,1)roman_Uniform ( 0 , 1 ). Then, the softmax function with temperature coefficient τ𝜏\tauitalic_τ is used as a continuous, differentiable approximation to argmax()subscriptabsent\arg\max_{\ }(\cdot)roman_arg roman_max start_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ), obtaining the soft version 𝐏isoftqsuperscriptsubscript𝐏𝑖softsuperscript𝑞\mathbf{P}_{i}^{\text{soft}}\in{\mathbb{R}^{q}}bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT soft end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT of 𝐏ihardsuperscriptsubscript𝐏𝑖hard\mathbf{P}_{i}^{\text{hard}}bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT hard end_POSTSUPERSCRIPT:

𝐏isoft=e(log(𝐃i,j)+𝐆i,j)/τj=1qe(log(𝐃i,j)+𝐆i,j)/τ.superscriptsubscript𝐏𝑖softsuperscript𝑒subscript𝐃𝑖𝑗subscript𝐆𝑖𝑗𝜏superscriptsubscript𝑗1𝑞superscript𝑒subscript𝐃𝑖𝑗subscript𝐆𝑖𝑗𝜏\mathbf{P}_{i}^{\text{soft}}=\frac{e^{(\log(\mathbf{D}_{i,j})+\mathbf{G}_{i,j}% )/\tau}}{{\sum_{j=1}^{q}}e^{(\log(\mathbf{D}_{i,j})+\mathbf{G}_{i,j})/\tau}}.bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT soft end_POSTSUPERSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT ( roman_log ( bold_D start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) + bold_G start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) / italic_τ end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ( roman_log ( bold_D start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) + bold_G start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) / italic_τ end_POSTSUPERSCRIPT end_ARG . (21)

Through the Gumbel-Softmax trick, we involve 𝐏ihardsuperscriptsubscript𝐏𝑖hard\mathbf{P}_{i}^{\text{hard}}bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT hard end_POSTSUPERSCRIPT in the forward propagation of the network, however, during backpropagation, we update the parameters by computing the gradient of 𝐏isoftsuperscriptsubscript𝐏𝑖soft\mathbf{P}_{i}^{\text{soft}}bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT soft end_POSTSUPERSCRIPT. As the temperature coefficient τ𝜏\tauitalic_τ decreases, the soft version 𝐏isoftsuperscriptsubscript𝐏𝑖soft\mathbf{P}_{i}^{\text{soft}}bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT soft end_POSTSUPERSCRIPT becomes closer to the hard version 𝐏ihardsuperscriptsubscript𝐏𝑖hard\mathbf{P}_{i}^{\text{hard}}bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT hard end_POSTSUPERSCRIPT, which gradually aligns the forward and backward propagation processes of the network. However, small τ𝜏\tauitalic_τ can lead to instability of training. Therefore, we choose a large temperature coefficient τ𝜏\tauitalic_τ at the beginning and gradually decay it during the training process. We use 𝐏ihardsuperscriptsubscript𝐏𝑖hard\mathbf{P}_{i}^{\text{hard}}bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT hard end_POSTSUPERSCRIPT to select the corresponding bandwidth for each token 𝐘v,isubscript𝐘𝑣𝑖\mathbf{Y}_{v,i}bold_Y start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT, indicating the number of retained channels. Note that for every 𝐘vlv×dsubscript𝐘𝑣superscriptsubscript𝑙𝑣𝑑\mathbf{Y}_{v}\in{\mathbb{R}^{l_{v}\times d}}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT, a corresponding 𝐛lv𝐛superscriptsubscript𝑙𝑣\mathbf{b}\in{\mathbb{R}^{l_{v}}}bold_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT need to be transmitted through the lossless link to indicate the number of retained channels for each token at the receiver. 𝐒vsubscript𝐒𝑣\mathbf{S}_{v}bold_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is generated by masking part of channels in 𝐘vsubscript𝐘𝑣\mathbf{Y}_{v}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT as zero according to 𝐌𝐌\mathbf{M}bold_M.

At the receiver, we first generate 𝐌𝐌\mathbf{M}bold_M based on received 𝐛𝐛\mathbf{b}bold_b. Then, we unflatten and zero-pad noisy real channel output symbols 𝐬^vsubscript^𝐬𝑣\hat{\mathbf{s}}_{v}over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT based on 𝐌𝐌\mathbf{M}bold_M to get 𝐒^vsubscript^𝐒𝑣\hat{\mathbf{S}}_{v}over^ start_ARG bold_S end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. The learnable vector 𝐜d𝐜superscript𝑑\mathbf{c}\in\mathbb{R}^{d}bold_c ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is developed to compensate for the information lost due to the channel masking operation. For each token 𝐒^v,idsubscript^𝐒𝑣𝑖superscript𝑑\hat{\mathbf{S}}_{v,i}\in\mathbb{R}^{d}over^ start_ARG bold_S end_ARG start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, if its j𝑗jitalic_j-th channel is masked, 𝐜jsubscript𝐜𝑗\mathbf{c}_{j}bold_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is selected as the initial value for this channel:

𝐒^v,i+(𝐉i𝐌i)𝐜𝐒^v,i,subscript^𝐒𝑣𝑖direct-productsubscript𝐉𝑖subscript𝐌𝑖𝐜subscript^𝐒𝑣𝑖\hat{\mathbf{S}}_{v,i}+(\mathbf{J}_{i}-\mathbf{M}_{i})\odot\mathbf{c}\to\hat{% \mathbf{S}}_{v,i},over^ start_ARG bold_S end_ARG start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT + ( bold_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⊙ bold_c → over^ start_ARG bold_S end_ARG start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT , (22)

where 𝐉lv×d𝐉superscriptsubscript𝑙𝑣𝑑\mathbf{J}\in\mathbb{R}^{l_{v}\times d}bold_J ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT denotes the matrix whose all elements are set to 1111. Then, 𝐒^vsubscript^𝐒𝑣\hat{\mathbf{S}}_{v}over^ start_ARG bold_S end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and 𝐘ratesubscript𝐘rate\mathbf{Y}_{\text{rate}}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT are inputted to the JSC decoder fd(;ϕ)subscript𝑓𝑑bold-italic-ϕf_{d}(\cdot;\boldsymbol{\phi})italic_f start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( ⋅ ; bold_italic_ϕ ) to decode the video semantics 𝐘^vsubscript^𝐘𝑣\hat{\mathbf{Y}}_{v}over^ start_ARG bold_Y end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT progressively.

2.4 Content-Adaptive and SNR-Adaptive

The bandwidth allocation method described in Section 2.3 can achieve bandwidth efficiency for VideoQA-SC. The decision 𝐃𝐃\mathbf{D}bold_D is only determined by a series versions of 𝐘ratesubscript𝐘rate\mathbf{Y}_{\text{rate}}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT (𝐘rate1superscriptsubscript𝐘rate1\mathbf{Y}_{\text{rate}}^{1}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT,𝐘rate2superscriptsubscript𝐘rate2\mathbf{Y}_{\text{rate}}^{2}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT,…,𝐘rateLsuperscriptsubscript𝐘rate𝐿\mathbf{Y}_{\text{rate}}^{L}bold_Y start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT) and a series versions of 𝐘vsubscript𝐘𝑣\mathbf{Y}_{v}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT (𝐘v1superscriptsubscript𝐘𝑣1\mathbf{Y}_{v}^{1}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT,𝐘v2superscriptsubscript𝐘𝑣2\mathbf{Y}_{v}^{2}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT,…,𝐘vL1superscriptsubscript𝐘𝑣𝐿1\mathbf{Y}_{v}^{L-1}bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT), which indicates that the bandwidth allocation is adaptive to video semantics or video contents. Such rate predictors will make the same bandwidth allocation under different channel conditions, which is inconsistent with traditional channel coding ideas. Since content-adaptive bandwidth allocation is not robust to noise, it is difficult to support SC under diverse channel conditions.

Assume that the transmitter can obtain the perfect SNR via ideal channel estimation. By introducing channel SNR, the rate predictors can integrate video content with the current channel condition for more reasonable bandwidth allocation not only adaptive to video contents but also to SNR. Specifically, we take SNR as the additional input of all rate predictors to make decisions adapt to the current channel condition at each layer of the JSC encoder.

For SNR-adaptive bandwidth allocation, 𝐃lsubscript𝐃𝑙\mathbf{D}_{l}bold_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is computed by:

𝐃l=MLP(𝐙ratel,SNR),subscript𝐃𝑙MLPsuperscriptsubscript𝐙rate𝑙SNR\mathbf{D}_{l}=\operatorname{MLP}(\mathbf{Z}_{\text{rate}}^{l},\text{SNR}),bold_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = roman_MLP ( bold_Z start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , SNR ) , (23)

and 𝐃𝐃\mathbf{D}bold_D is computed by:

𝐃=MLP(concat(𝐙rateL,β(𝐃1𝐃L1),SNR)).𝐃MLPconcatsuperscriptsubscript𝐙rate𝐿𝛽subscript𝐃1subscript𝐃𝐿1SNR\mathbf{D}=\operatorname{MLP}(\operatorname{concat}(\mathbf{Z}_{\text{rate}}^{% L},\beta(\mathbf{D}_{1}\dots\mathbf{D}_{L-1}),\text{SNR})).bold_D = roman_MLP ( roman_concat ( bold_Z start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT , italic_β ( bold_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … bold_D start_POSTSUBSCRIPT italic_L - 1 end_POSTSUBSCRIPT ) , SNR ) ) . (24)

In this way, as SNR is repeatedly used in rate predictors, the network is forced to learn dynamic bandwidth allocation strategies based on channel conditions, which enables VideoQA-SC robust to noise.

2.5 Multimodal Fuser

The multimodal fuser u(;𝝂)𝑢𝝂u(\cdot;\boldsymbol{\nu})italic_u ( ⋅ ; bold_italic_ν ) is used to interact the video and text information, and find the informative video contents with respect to the question for answer prediction.

Consider the process of a particular question-answer (QA) pair, such as the question q(“what does the butterfly do 10 or more than 10 times”)𝑞“what does the butterfly do 10 or more than 10 times”q(\textit{``what does the butterfly do 10 or more than 10 times''})italic_q ( “what does the butterfly do 10 or more than 10 times” ) and the candidate answers a0(“stuff marshmallow”)subscript𝑎0“stuff marshmallow”a_{0}(\textit{``stuff marshmallow''})italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( “stuff marshmallow” ), a1(“fall over”)subscript𝑎1“fall over”a_{1}(\textit{``fall over''})italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( “fall over” ) and a2(“flap wings”)subscript𝑎2“flap wings”a_{2}(\textit{``flap wings''})italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( “flap wings” ). The text information can be organized as a tuple including three sequences (q[SEP]a0,q[SEP]a1,q[SEP]a2)𝑞delimited-[]SEPsubscript𝑎0𝑞delimited-[]SEPsubscript𝑎1𝑞delimited-[]SEPsubscript𝑎2(q\left[\operatorname{SEP}\right]a_{0},q\left[\operatorname{SEP}\right]a_{1},q% \left[\operatorname{SEP}\right]a_{2})( italic_q [ roman_SEP ] italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_q [ roman_SEP ] italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q [ roman_SEP ] italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), which means each candidate answer is paired with the question to form a language sequence. [SEP]delimited-[]SEP\left[\operatorname{SEP}\right][ roman_SEP ] is a special sign used to separate the text of the question and the answer. Then, each sequence is transformed into tokens by the tokenizer. We use a language model to capture the correlations between each token in one sequence and extract the candidate QA-pair feature 𝐘qb×si×dsubscript𝐘𝑞superscript𝑏subscript𝑠𝑖𝑑\mathbf{Y}_{q}\in\mathbb{R}^{b\times{s_{i}}\times d}bold_Y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT from 𝐗qsubscript𝐗𝑞\mathbf{X}_{q}bold_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, where b𝑏bitalic_b is the number of candidate answers and sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the length of i𝑖iitalic_i-th sequence.

Given 𝐘^vsubscript^𝐘𝑣\hat{\mathbf{Y}}_{v}over^ start_ARG bold_Y end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and 𝐘qsubscript𝐘𝑞\mathbf{Y}_{q}bold_Y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, the interaction of the two modal information can be achieved through attention-based weighted fusion. After linear projection, 𝐘^vsubscript^𝐘𝑣\hat{\mathbf{Y}}_{v}over^ start_ARG bold_Y end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is mapped to 𝐄vlv×dsubscript𝐄𝑣superscriptsubscript𝑙𝑣𝑑\mathbf{E}_{v}\in\mathbb{R}^{l_{v}\times d}bold_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT as query and 𝐘qsubscript𝐘𝑞\mathbf{Y}_{q}bold_Y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is mapped to 𝐄qb×si×dsubscript𝐄𝑞superscript𝑏subscript𝑠𝑖𝑑\mathbf{E}_{q}\in\mathbb{R}^{b\times{s_{i}}\times d}bold_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT as key. Then, we caculate the attention of 𝐘^vsubscript^𝐘𝑣\hat{\mathbf{Y}}_{v}over^ start_ARG bold_Y end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT to each candidate QA pair feature 𝐘q,isubscript𝐘𝑞𝑖\mathbf{Y}_{q,i}bold_Y start_POSTSUBSCRIPT italic_q , italic_i end_POSTSUBSCRIPT, and utilize attention-based fusion of 𝐘^vsubscript^𝐘𝑣\hat{\mathbf{Y}}_{v}over^ start_ARG bold_Y end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and 𝐘qsubscript𝐘𝑞\mathbf{Y}_{q}bold_Y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT to get the QA-aware video feature 𝐘qvlv×dsubscript𝐘𝑞𝑣superscriptsubscript𝑙𝑣𝑑\mathbf{Y}_{qv}\in\mathbb{R}^{l_{v}\times d}bold_Y start_POSTSUBSCRIPT italic_q italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT, which can be formulated as:

𝜸i=softmax(𝐄v(𝐄q,i)T),𝐘qv=𝐘^v+i=1b𝜸i𝐘q,i.missing-subexpressionsubscript𝜸𝑖softmaxsubscript𝐄𝑣superscriptsubscript𝐄𝑞𝑖𝑇missing-subexpressionsubscript𝐘𝑞𝑣subscript^𝐘𝑣superscriptsubscript𝑖1𝑏subscript𝜸𝑖subscript𝐘𝑞𝑖\displaystyle\begin{aligned} &\boldsymbol{\gamma}_{i}=\operatorname{softmax}(% \mathbf{E}_{v}(\mathbf{E}_{q,i})^{T}),\\ &\mathbf{Y}_{qv}=\hat{\mathbf{Y}}_{v}+{\sum_{i=1}^{b}}\boldsymbol{\gamma}_{i}% \mathbf{Y}_{q,i}.\end{aligned}start_ROW start_CELL end_CELL start_CELL bold_italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_softmax ( bold_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( bold_E start_POSTSUBSCRIPT italic_q , italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL bold_Y start_POSTSUBSCRIPT italic_q italic_v end_POSTSUBSCRIPT = over^ start_ARG bold_Y end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT bold_italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_Y start_POSTSUBSCRIPT italic_q , italic_i end_POSTSUBSCRIPT . end_CELL end_ROW (25)

We add some Transformer block to refine the QA-aware feature 𝐘qvsubscript𝐘𝑞𝑣\mathbf{Y}_{qv}bold_Y start_POSTSUBSCRIPT italic_q italic_v end_POSTSUBSCRIPT and average pool it along the token dimension to get the global QA-aware video feature 𝐘qvglobal1×dsuperscriptsubscript𝐘𝑞𝑣globalsuperscript1𝑑\mathbf{Y}_{qv}^{\text{global}}\in\mathbb{R}^{1\times d}bold_Y start_POSTSUBSCRIPT italic_q italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT global end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d end_POSTSUPERSCRIPT. Similarly, the candidate QA-pair feature 𝐘qsubscript𝐘𝑞\mathbf{Y}_{q}bold_Y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is average pooled along the sequence length dimension to get the global text feature 𝐘qglobalb×dsuperscriptsubscript𝐘𝑞globalsuperscript𝑏𝑑\mathbf{Y}_{q}^{\text{global}}\in\mathbb{R}^{b\times d}bold_Y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT global end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_d end_POSTSUPERSCRIPT. We simply utilize the dot product with softmax to obtain the answer score 𝐚pred1×bsubscript𝐚predsuperscript1𝑏\mathbf{a}_{\text{pred}}\in\mathbb{R}^{1\times b}bold_a start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_b end_POSTSUPERSCRIPT by measuring the similarity between 𝐘qglobalsuperscriptsubscript𝐘𝑞global\mathbf{Y}_{q}^{\text{global}}bold_Y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT global end_POSTSUPERSCRIPT and 𝐘qvglobalsuperscriptsubscript𝐘𝑞𝑣global\mathbf{Y}_{qv}^{\text{global}}bold_Y start_POSTSUBSCRIPT italic_q italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT global end_POSTSUPERSCRIPT:

𝐚pred=softmax(𝐘qvglobal(𝐘qglobal)T).subscript𝐚predsoftmaxsuperscriptsubscript𝐘𝑞𝑣globalsuperscriptsuperscriptsubscript𝐘𝑞global𝑇\mathbf{a}_{\text{pred}}=\operatorname{softmax}(\mathbf{Y}_{qv}^{\text{global}% }(\mathbf{Y}_{q}^{\text{global}})^{T}).bold_a start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT = roman_softmax ( bold_Y start_POSTSUBSCRIPT italic_q italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT global end_POSTSUPERSCRIPT ( bold_Y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT global end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) . (26)

Finally, the answer with the highest prediction score is output as the predicted answer asuperscript𝑎a^{\star}italic_a start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT by the multimodal fuser u(;𝝂)𝑢𝝂u(\cdot;\boldsymbol{\nu})italic_u ( ⋅ ; bold_italic_ν ):

a=argmaxiapred,i,superscript𝑎subscript𝑖subscript𝑎pred𝑖a^{\star}=\mathop{\arg\max}\limits_{i}{a_{\text{pred},i}},italic_a start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT pred , italic_i end_POSTSUBSCRIPT , (27)

where apred,isubscript𝑎pred𝑖{a_{\text{pred},i}}italic_a start_POSTSUBSCRIPT pred , italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th element of 𝐚predsubscript𝐚pred\mathbf{a}_{\text{pred}}bold_a start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT.

Algorithm 1 Forward process of VideoQA-SC over the AWGN channel.
1:The chosen mini-batch data {(𝐗v,𝐗q)}ii+bzsuperscriptsubscriptsubscript𝐗𝑣subscript𝐗𝑞𝑖𝑖𝑏𝑧\left\{(\mathbf{X}_{v},\mathbf{X}_{q})\right\}_{i}^{i+bz}{ ( bold_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT; training channel SNR SNRtrain,isubscriptSNRtrain𝑖\text{SNR}_{\text{train},i}SNR start_POSTSUBSCRIPT train , italic_i end_POSTSUBSCRIPT and model parameters 𝜻𝜻\boldsymbol{\zeta}bold_italic_ζ, 𝜽𝜽\boldsymbol{\theta}bold_italic_θ, ϵbold-italic-ϵ\boldsymbol{\epsilon}bold_italic_ϵ, ϕbold-italic-ϕ\boldsymbol{\phi}bold_italic_ϕ, 𝝂𝝂\boldsymbol{\nu}bold_italic_ν.
2:the answer score {𝐚pred}ii+bzsuperscriptsubscriptsubscript𝐚pred𝑖𝑖𝑏𝑧\left\{\mathbf{a}_{\text{pred}}\right\}_{i}^{i+bz}{ bold_a start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT and the binary mask matrix 𝐌𝐌\mathbf{M}bold_M.
3:g({𝐗v}ii+bz;𝜻){𝐘v}ii+bz𝑔superscriptsubscriptsubscript𝐗𝑣𝑖𝑖𝑏𝑧𝜻superscriptsubscriptsubscript𝐘𝑣𝑖𝑖𝑏𝑧g(\left\{\mathbf{X}_{v}\right\}_{i}^{i+bz};\boldsymbol{\zeta})\to\left\{% \mathbf{Y}_{v}\right\}_{i}^{i+bz}italic_g ( { bold_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT ; bold_italic_ζ ) → { bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT.
4:if SNRtrain,i=subscriptSNRtrain𝑖absent\text{SNR}_{\text{train},i}=SNR start_POSTSUBSCRIPT train , italic_i end_POSTSUBSCRIPT =None then
5:     fe({𝐘v}ii+bz;𝜽){𝐒v}ii+bzsubscript𝑓𝑒superscriptsubscriptsubscript𝐘𝑣𝑖𝑖𝑏𝑧𝜽superscriptsubscriptsubscript𝐒𝑣𝑖𝑖𝑏𝑧f_{e}(\left\{\mathbf{Y}_{v}\right\}_{i}^{i+bz};\boldsymbol{\theta})\to\left\{% \mathbf{S}_{v}\right\}_{i}^{i+bz}italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( { bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT ; bold_italic_θ ) → { bold_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT and generate {𝐌}ii+bzsuperscriptsubscript𝐌𝑖𝑖𝑏𝑧\left\{\mathbf{M}\right\}_{i}^{i+bz}{ bold_M } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT.
6:else
7:     fe({𝐘v}ii+bz,SNRtrain;𝜽,ϵ){𝐒v}ii+bzsubscript𝑓𝑒superscriptsubscriptsubscript𝐘𝑣𝑖𝑖𝑏𝑧subscriptSNRtrain𝜽bold-italic-ϵsuperscriptsubscriptsubscript𝐒𝑣𝑖𝑖𝑏𝑧f_{e}(\left\{\mathbf{Y}_{v}\right\}_{i}^{i+bz},\text{SNR}_{\text{train}};% \boldsymbol{\theta},\boldsymbol{\epsilon})\to\left\{\mathbf{S}_{v}\right\}_{i}% ^{i+bz}italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( { bold_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT , SNR start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ; bold_italic_θ , bold_italic_ϵ ) → { bold_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT and generate {𝐌}ii+bzsuperscriptsubscript𝐌𝑖𝑖𝑏𝑧\left\{\mathbf{M}\right\}_{i}^{i+bz}{ bold_M } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT.
8:end if
9:Flatten and real-to-complex: {𝐒v}ii+bz{𝐬vc}ii+bzsuperscriptsubscriptsubscript𝐒𝑣𝑖𝑖𝑏𝑧superscriptsubscriptsuperscriptsubscript𝐬𝑣𝑐𝑖𝑖𝑏𝑧\left\{\mathbf{S}_{v}\right\}_{i}^{i+bz}\to\left\{\mathbf{s}_{v}^{c}\right\}_{% i}^{i+bz}{ bold_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT → { bold_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT .
10:Compute the average noise power σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT by SNRtrain,isubscriptSNRtrain𝑖\text{SNR}_{\text{train},i}SNR start_POSTSUBSCRIPT train , italic_i end_POSTSUBSCRIPT and sample 𝐧𝐧\mathbf{n}bold_n from 𝒞𝒩(0,σ2𝐈)𝒞𝒩0superscript𝜎2𝐈\mathcal{CN}(0,\sigma^{2}\mathbf{I})caligraphic_C caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) bz𝑏𝑧bzitalic_b italic_z times.
11:{𝐬^vc}ii+bzsuperscriptsubscriptsuperscriptsubscript^𝐬𝑣𝑐𝑖𝑖𝑏𝑧\left\{\hat{\mathbf{s}}_{v}^{c}\right\}_{i}^{i+bz}{ over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT = {𝐬vc}ii+bz+{𝐧}ii+bzsuperscriptsubscriptsuperscriptsubscript𝐬𝑣𝑐𝑖𝑖𝑏𝑧superscriptsubscript𝐧𝑖𝑖𝑏𝑧\left\{\mathbf{s}_{v}^{c}\right\}_{i}^{i+bz}+\left\{\mathbf{n}\right\}_{i}^{i+bz}{ bold_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT + { bold_n } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT.
12:Complex-to-real and unflatten: {𝐬^vc}ii+bz{𝐒^v}ii+bzsuperscriptsubscriptsuperscriptsubscript^𝐬𝑣𝑐𝑖𝑖𝑏𝑧superscriptsubscriptsubscript^𝐒𝑣𝑖𝑖𝑏𝑧\left\{\hat{\mathbf{s}}_{v}^{c}\right\}_{i}^{i+bz}\to\left\{\hat{\mathbf{S}}_{% v}\right\}_{i}^{i+bz}{ over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT → { over^ start_ARG bold_S end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT.
13:fd({𝐒^v}ii+bz;ϕ){𝐘^v}ii+bzsubscript𝑓𝑑superscriptsubscriptsubscript^𝐒𝑣𝑖𝑖𝑏𝑧bold-italic-ϕsuperscriptsubscriptsubscript^𝐘𝑣𝑖𝑖𝑏𝑧f_{d}\left(\left\{\hat{\mathbf{S}}_{v}\right\}_{i}^{i+bz};\boldsymbol{\phi}% \right)\to\left\{\hat{\mathbf{Y}}_{v}\right\}_{i}^{i+bz}italic_f start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( { over^ start_ARG bold_S end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT ; bold_italic_ϕ ) → { over^ start_ARG bold_Y end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT.
14:u({𝐘^v}ii+bz,{𝐗q}ii+bz;𝝂){𝐚pred}ii+bz𝑢superscriptsubscriptsubscript^𝐘𝑣𝑖𝑖𝑏𝑧superscriptsubscriptsubscript𝐗𝑞𝑖𝑖𝑏𝑧𝝂superscriptsubscriptsubscript𝐚pred𝑖𝑖𝑏𝑧u\left(\left\{\hat{\mathbf{Y}}_{v}\right\}_{i}^{i+bz},\left\{\mathbf{X}_{q}% \right\}_{i}^{i+bz};\boldsymbol{\nu}\right)\to\left\{\mathbf{a}_{\text{pred}}% \right\}_{i}^{i+bz}italic_u ( { over^ start_ARG bold_Y end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT , { bold_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT ; bold_italic_ν ) → { bold_a start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT.
15:return {𝐚pred}ii+bzsuperscriptsubscriptsubscript𝐚pred𝑖𝑖𝑏𝑧\left\{\mathbf{a}_{\text{pred}}\right\}_{i}^{i+bz}{ bold_a start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT, {𝐌}ii+bzsuperscriptsubscript𝐌𝑖𝑖𝑏𝑧\left\{\mathbf{M}\right\}_{i}^{i+bz}{ bold_M } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT.
Algorithm 2 Training of the SNR-adaptive VideoQA-SC.
1:the training dataset 𝒟trainsubscript𝒟train\mathcal{D}_{\text{train}}caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT; batchsize bz𝑏𝑧bzitalic_b italic_z; trained VideoQA model parameters 𝜻𝜻\boldsymbol{\zeta}bold_italic_ζ and 𝝂𝝂\boldsymbol{\nu}bold_italic_ν; the trade-off hyperparameter λ𝜆\lambdaitalic_λ.
2:the trained parameters 𝜻superscript𝜻\boldsymbol{\zeta}^{\star}bold_italic_ζ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, 𝜽superscript𝜽\boldsymbol{\theta}^{\star}bold_italic_θ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, ϵsuperscriptbold-italic-ϵ\boldsymbol{\epsilon}^{\star}bold_italic_ϵ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, ϕsuperscriptbold-italic-ϕ\boldsymbol{\phi}^{\star}bold_italic_ϕ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, 𝝂superscript𝝂\boldsymbol{\nu}^{\star}bold_italic_ν start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT.
3:Initialization : DJSCC transmission model parameters 𝜽𝜽\boldsymbol{\theta}bold_italic_θ, ϵbold-italic-ϵ\boldsymbol{\epsilon}bold_italic_ϵ, ϕbold-italic-ϕ\boldsymbol{\phi}bold_italic_ϕ; training SNR range: (SNRstart,SNRend)subscriptSNRstartsubscriptSNRend(\text{SNR}_{\text{start}},\text{SNR}_{\text{end}})( SNR start_POSTSUBSCRIPT start end_POSTSUBSCRIPT , SNR start_POSTSUBSCRIPT end end_POSTSUBSCRIPT ).
4:Freeze 𝜻𝜻\boldsymbol{\zeta}bold_italic_ζ, 𝝂𝝂\boldsymbol{\nu}bold_italic_ν and ϵbold-italic-ϵ\boldsymbol{\epsilon}bold_italic_ϵ. \triangleright fixed-bandwidth DJSCC training
5:Choose mini-batch data {(𝐗v,𝐗q,𝐚label)}ii+bzsuperscriptsubscriptsubscript𝐗𝑣subscript𝐗𝑞subscript𝐚label𝑖𝑖𝑏𝑧\left\{(\mathbf{X}_{v},\mathbf{X}_{q},\mathbf{a}_{\text{label}})\right\}_{i}^{% i+bz}{ ( bold_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT label end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT.
6:Sample SNRtrain,isubscriptSNRtrain𝑖\text{SNR}_{\text{train},i}SNR start_POSTSUBSCRIPT train , italic_i end_POSTSUBSCRIPT from Uniform(SNRstart,SNRend)UniformsubscriptSNRstartsubscriptSNRend\operatorname{Uniform}(\text{SNR}_{\text{start}},\text{SNR}_{\text{end}})roman_Uniform ( SNR start_POSTSUBSCRIPT start end_POSTSUBSCRIPT , SNR start_POSTSUBSCRIPT end end_POSTSUBSCRIPT ).
7:Compute {𝐚pred}ii+bzsuperscriptsubscriptsubscript𝐚pred𝑖𝑖𝑏𝑧\left\{\mathbf{a}_{\text{pred}}\right\}_{i}^{i+bz}{ bold_a start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT by Algorithm 1 without input SNRtrain,isubscriptSNRtrain𝑖\text{SNR}_{\text{train},i}SNR start_POSTSUBSCRIPT train , italic_i end_POSTSUBSCRIPT.
8:Compute stage2subscriptstage2\mathcal{L}_{\text{stage2}}caligraphic_L start_POSTSUBSCRIPT stage2 end_POSTSUBSCRIPT by Eq. (29) and optimize 𝜽𝜽\boldsymbol{\theta}bold_italic_θ and ϕbold-italic-ϕ\boldsymbol{\phi}bold_italic_ϕ by gradient descent.
9:Repeat line 3-6 to complete the training of stage 2.
10:Unfreeze ϵbold-italic-ϵ\boldsymbol{\epsilon}bold_italic_ϵ. \triangleright bandwidth-adaptive DJSCC training
11:Repeat line 3-4.
12:Compute {𝐚pred}ii+bzsuperscriptsubscriptsubscript𝐚pred𝑖𝑖𝑏𝑧\left\{\mathbf{a}_{\text{pred}}\right\}_{i}^{i+bz}{ bold_a start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT and {𝐌}ii+bzsuperscriptsubscript𝐌𝑖𝑖𝑏𝑧\left\{\mathbf{M}\right\}_{i}^{i+bz}{ bold_M } start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_b italic_z end_POSTSUPERSCRIPT by Algorithm 1 with input SNRtrain,isubscriptSNRtrain𝑖\text{SNR}_{\text{train},i}SNR start_POSTSUBSCRIPT train , italic_i end_POSTSUBSCRIPT.
13:Compute stage3subscriptstage3\mathcal{L}_{\text{stage3}}caligraphic_L start_POSTSUBSCRIPT stage3 end_POSTSUBSCRIPT by Eq. (30) and optimize 𝜽𝜽\boldsymbol{\theta}bold_italic_θ, ϕbold-italic-ϕ\boldsymbol{\phi}bold_italic_ϕ and ϵbold-italic-ϵ\boldsymbol{\epsilon}bold_italic_ϵ by gradient descent.
14:Repeat line 9-11 to complete the training of stage 3.
15:Unfreeze 𝜻𝜻\boldsymbol{\zeta}bold_italic_ζ and 𝝂𝝂\boldsymbol{\nu}bold_italic_ν. \triangleright VideoQA-SC finetuning
16:Repeat line 9-10.
17:Compute stage4subscriptstage4\mathcal{L}_{\text{stage4}}caligraphic_L start_POSTSUBSCRIPT stage4 end_POSTSUBSCRIPT by Eq. (31) and optimize all parameters by gradient descent.
18:Repeat line 14-15 to complete the training of stage 4.
19:return Optimized 𝜻superscript𝜻\boldsymbol{\zeta}^{\star}bold_italic_ζ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, 𝜽superscript𝜽\boldsymbol{\theta}^{\star}bold_italic_θ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, ϵsuperscriptbold-italic-ϵ\boldsymbol{\epsilon}^{\star}bold_italic_ϵ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, ϕsuperscriptbold-italic-ϕ\boldsymbol{\phi}^{\star}bold_italic_ϕ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, 𝝂superscript𝝂\boldsymbol{\nu}^{\star}bold_italic_ν start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT.

2.6 Training Strategy

VideoQA-SC can be considered as a combination of the VideoQA model (g(;𝜻)𝑔𝜻g(\cdot;\boldsymbol{\zeta})italic_g ( ⋅ ; bold_italic_ζ ), u(;𝝂)𝑢𝝂u(\cdot;\boldsymbol{\nu})italic_u ( ⋅ ; bold_italic_ν )) and the DJSCC transmission model (fe(;𝜽,ϵ)subscript𝑓𝑒𝜽bold-italic-ϵf_{e}(\cdot;\boldsymbol{\theta},\boldsymbol{\epsilon})italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( ⋅ ; bold_italic_θ , bold_italic_ϵ ), fd(;ϕ)subscript𝑓𝑑bold-italic-ϕf_{d}(\cdot;\boldsymbol{\phi})italic_f start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( ⋅ ; bold_italic_ϕ )), where the VideoQA model is responsible for accurate execution of VideoQA tasks and the DJSCC model is responsible for reliable and efficient transmission of video semantics. Instead of E2E training the whole system from scratch, we adopt a progressive training strategy to ensure the stability of training process. The progressive training strategy mainly includes 4 stages:

2.6.1 VideoQA Model Training

Train the VideoQA model g(;𝜻)𝑔𝜻g(\cdot;\boldsymbol{\zeta})italic_g ( ⋅ ; bold_italic_ζ ) and u(;𝝂)𝑢𝝂u(\cdot;\boldsymbol{\nu})italic_u ( ⋅ ; bold_italic_ν ) without considering the noisy channel. The video semantics extracted by the video semantic encoder g(;𝜻)𝑔𝜻g(\cdot;\boldsymbol{\zeta})italic_g ( ⋅ ; bold_italic_ζ ) are losslessly inputted into multimodal fuser u(;𝝂)𝑢𝝂u(\cdot;\boldsymbol{\nu})italic_u ( ⋅ ; bold_italic_ν ) to predict the answer. The cross-entropy between the prediction of the VideoQA model and the one-hot form label is used as the task loss tasksubscripttask\mathcal{L}_{\text{task}}caligraphic_L start_POSTSUBSCRIPT task end_POSTSUBSCRIPT in this stage:

stage1=task=CE(𝐚pred,𝐚label),subscriptstage1subscripttaskCEsubscript𝐚predsubscript𝐚label\mathcal{L}_{\text{stage1}}=\mathcal{L}_{\text{task}}=\operatorname{CE}(% \mathbf{a}_{\text{pred}},\mathbf{a}_{\text{label}}),caligraphic_L start_POSTSUBSCRIPT stage1 end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT task end_POSTSUBSCRIPT = roman_CE ( bold_a start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT label end_POSTSUBSCRIPT ) , (28)

where CE(,)CE\operatorname{CE}(\cdot,\cdot)roman_CE ( ⋅ , ⋅ ) is the cross-entropy loss.

2.6.2 Fixed-Bandwidth DJSCC Transmission Training

Train the fixed-bandwidth DJSCC transmission model (fe(;𝜽,ϵ)subscript𝑓𝑒𝜽bold-italic-ϵf_{e}(\cdot;\boldsymbol{\theta},\boldsymbol{\epsilon})italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( ⋅ ; bold_italic_θ , bold_italic_ϵ ) and fd(;ϕ)subscript𝑓𝑑bold-italic-ϕf_{d}(\cdot;\boldsymbol{\phi})italic_f start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( ⋅ ; bold_italic_ϕ )) without rate predictors parameters ϵbold-italic-ϵ\boldsymbol{\epsilon}bold_italic_ϵ while freezing VideoQA model parameters 𝜻𝜻\boldsymbol{\zeta}bold_italic_ζ and 𝝂𝝂\boldsymbol{\nu}bold_italic_ν. Taking the noisy channel into account, the training objective of the fixed-bandwidth DJSCC transmission model is to maximize the task performance. This stage has the same loss as stage1subscriptstage1\mathcal{L}_{\text{stage1}}caligraphic_L start_POSTSUBSCRIPT stage1 end_POSTSUBSCRIPT.

stage2=task=CE(𝐚pred,𝐚label).subscriptstage2subscripttaskCEsubscript𝐚predsubscript𝐚label\mathcal{L}_{\text{stage2}}=\mathcal{L}_{\text{task}}=\operatorname{CE}(% \mathbf{a}_{\text{pred}},\mathbf{a}_{\text{label}}).caligraphic_L start_POSTSUBSCRIPT stage2 end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT task end_POSTSUBSCRIPT = roman_CE ( bold_a start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT label end_POSTSUBSCRIPT ) . (29)

2.6.3 Bandwidth-Adaptive DJSCC Transmission Training

Train the bandwidth-adaptive DJSCC transmission model (fe(;𝜽,ϵ)subscript𝑓𝑒𝜽bold-italic-ϵf_{e}(\cdot;\boldsymbol{\theta},\boldsymbol{\epsilon})italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( ⋅ ; bold_italic_θ , bold_italic_ϵ ) and fd(;ϕ)subscript𝑓𝑑bold-italic-ϕf_{d}(\cdot;\boldsymbol{\phi})italic_f start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( ⋅ ; bold_italic_ϕ )) with parameters of rate predictors ϵbold-italic-ϵ\boldsymbol{\epsilon}bold_italic_ϵ. The training objective of this stage can be formulated as the trade-off between the task performance and the bandwidth cost. Therefore, the bandwidth cost loss ratesubscriptrate\mathcal{L}_{\text{rate}}caligraphic_L start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT with the hyperparameter λ𝜆\lambdaitalic_λ is introduced in this stage:

stage3=task+λrate=CE(𝐚pred,𝐚label)+ij𝐌i,j,subscriptstage3subscripttask𝜆subscriptrateCEsubscript𝐚predsubscript𝐚labelsubscript𝑖subscript𝑗subscript𝐌𝑖𝑗\mathcal{L}_{\text{stage3}}=\mathcal{L}_{\text{task}}+\lambda\mathcal{L}_{% \text{rate}}=\operatorname{CE}(\mathbf{a}_{\text{pred}},\mathbf{a}_{\text{% label}})+\sum_{i}\sum_{j}\mathbf{M}_{i,j},caligraphic_L start_POSTSUBSCRIPT stage3 end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT task end_POSTSUBSCRIPT + italic_λ caligraphic_L start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT = roman_CE ( bold_a start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT label end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_M start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , (30)

where 𝐌i,jsubscript𝐌𝑖𝑗\mathbf{M}_{i,j}bold_M start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT denotes the i𝑖iitalic_i-th row and j𝑗jitalic_j-th column element in the binary mask matrix 𝐌𝐌\mathbf{M}bold_M, and λ𝜆\lambdaitalic_λ controls the trade-off between the task performance and the bandwidth cost.

2.6.4 E2E Finetuning

Unfreezing VideoQA model parameters 𝜻𝜻\boldsymbol{\zeta}bold_italic_ζ and 𝝂𝝂\boldsymbol{\nu}bold_italic_ν. Jointly finetuning all system parameters 𝜻𝜻\boldsymbol{\zeta}bold_italic_ζ, 𝜽𝜽\boldsymbol{\theta}bold_italic_θ, ϵbold-italic-ϵ\boldsymbol{\epsilon}bold_italic_ϵ, ϕbold-italic-ϕ\boldsymbol{\phi}bold_italic_ϕ and 𝝂𝝂\boldsymbol{\nu}bold_italic_ν to improve the E2E system performance. This stage has the same loss as stage3subscriptstage3\mathcal{L}_{\text{stage3}}caligraphic_L start_POSTSUBSCRIPT stage3 end_POSTSUBSCRIPT:

stage4=task+λrate=CE(𝐚pred,𝐚label)+ij𝐌i,j.subscriptstage4subscripttask𝜆subscriptrateCEsubscript𝐚predsubscript𝐚labelsubscript𝑖subscript𝑗subscript𝐌𝑖𝑗\mathcal{L}_{\text{stage4}}=\mathcal{L}_{\text{task}}+\lambda\mathcal{L}_{% \text{rate}}=\operatorname{CE}(\mathbf{a}_{\text{pred}},\mathbf{a}_{\text{% label}})+\sum_{i}\sum_{j}\mathbf{M}_{i,j}.caligraphic_L start_POSTSUBSCRIPT stage4 end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT task end_POSTSUBSCRIPT + italic_λ caligraphic_L start_POSTSUBSCRIPT rate end_POSTSUBSCRIPT = roman_CE ( bold_a start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT label end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_M start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT . (31)

Algorithm 1 demonstrates the forward process of VideoQA-SC over the AWGN channel. Furthermore, taking SNR information into bandwidth allocation, the training process for SNR-adaptive VideoQA-SC is shown in Algorithm 2.

3 Experiments

In this section, we introduce the experimental setup and provide quantified experimental results to demonstrate the effectiveness of VideoQA-SC for performing VideoQA tasks. We compare VideoQA-SC with SC systems adopted traditional SSCC and advanced DJSCC transmission schemes under various channel conditions and bandwidth constraints.

3.1 Experimental Setup

3.1.1 Datasets

We choose the TGIF-QA dataset as the benchmark to conduct our experiments. As one of the popular datasets for VideoQA, TGIF-QA dataset consists of 165165165165165165165165 QA pairs chosen from 71741717417174171741 animated GIFs. To evaluate the spatiotemporal reasoning ability at the video level, TGIF-QA dataset designs four unique task types, i.e., repetition count, repeating action, state transition and frame QA.

We select repeating action and state transition for experiments, which are the two most challenging tasks in the TGIF-QA dataset. The two tasks are defined as multiple choice questions. Each question has 5555 candidate answers. The questions for repeating action involves identifying the repeated action in a video. The questions for state transition involves identifying the state before or after a particular state, including facial expressions, actions, places and object properties. For convenience, we refer to the repeating action task and state transition in the TGIF-QA dataset as the TGIF-QA Action dataset and TGIF-QA Transition dataset, respectively.

To overcome serious language bias in the original TGIF-QA dataset, we use questions and answers from an enhanced version of TGIF-QA, i.e., TGIF-QA-R [34] to force reasoning based on both text and video content. TGIF-QA-R has 20475204752047520475 QA pairs and 2274227422742274 QA pairs as training and testing datasets for repeating action task, respectively. It has 52704527045270452704 QA pairs and 6232623262326232 QA pairs as training and testing datasets for state transition task, respectively.

3.1.2 Implementation Details

In our experiments, lvsubscript𝑙𝑣l_{v}italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, lcsubscript𝑙𝑐{l_{c}}italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, lfsubscript𝑙𝑓{l_{f}}italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT are set to 16161616, 4444, 4444 for sparse sampling. r𝑟ritalic_r and m𝑚mitalic_m are set to 10101010 and 2048204820482048 for spatiotemporal semantic encoder. d𝑑ditalic_d is set to 256256256256 for DJSCC transmission, which also represents the maximum bandwidth that can be allocated for each token. Both the JSC encoder and the JSC decoder consist of L=4𝐿4L=4italic_L = 4 Transformer blocks for JSC encoding/decoding. There are also L=4𝐿4L=4italic_L = 4 rate predictors for bandwidth allocation during JSC encoding. q𝑞qitalic_q is set to 8888 for adaptive bandwidth allocation, which denotes the set of candidate retained channels for each token is ={2,4,8,16,32,64,128,256}248163264128256\mathcal{R}=\left\{2,4,8,16,32,64,128,256\right\}caligraphic_R = { 2 , 4 , 8 , 16 , 32 , 64 , 128 , 256 }. Each question has b=5𝑏5b=5italic_b = 5 candidate answers to form QA pairs.

In progressive training of VideoQA-SC, the training epochs for stage 1, stage 2, stage 3 and stage 4 are 20202020, 20202020, 20202020 and 10101010, respectively. The learning rates for stage 1, stage 2, stage 3 and stage 4 are 1×1051superscript1051\times{10}^{-5}1 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, 5×1065superscript1065\times{10}^{-6}5 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, 5×1065superscript1065\times{10}^{-6}5 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT and 2×1062superscript1062\times{10}^{-6}2 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, respectively. The Gumbel-Softmax trick is enabled in the training stage 3 and stage 4 for differentiable sampling. For stable training, we set the temperature coefficient τ=5𝜏5\tau=5italic_τ = 5 at the beginning and decay it by a factor of 0.90.90.90.9 after each epoch in stage 3. Similarly, we set τ=1𝜏1\tau=1italic_τ = 1 initially and decay it by a factor of 0.950.950.950.95 after each epoch in stage 4. We train models that satisfy different bandwidth constraints by tuning the hyperparameter λ𝜆\lambdaitalic_λ.

3.1.3 Comparison Schemes

We compare the VideoQA-SC with traditional SSCC-based SC systems and DJSCC-based SC systems that perform VideoQA based on reconstructed videos. Specifically, we use SSCC and DJSCC schemes to transmit videos over the noisy wireless channel, and use the same optimized VideoQA model(g(;𝜻)𝑔𝜻g(\cdot;\boldsymbol{\zeta})italic_g ( ⋅ ; bold_italic_ζ ) and u(;𝝂)𝑢𝝂u(\cdot;\boldsymbol{\nu})italic_u ( ⋅ ; bold_italic_ν )) to perform VideoQA tasks.

For SSCC-based SC systems, we adopt the traditional video codecs (H264/265H264265\text{H}264/265H 264 / 265) for source coding and assume that the channel capacity is achievable to obtain the upper bound of performance. The SSCC comparison schemes consist of “H264+channel capacityH264channel capacity\text{H}264+\text{channel capacity}H 264 + channel capacity" and “H265+channel capacityH265channel capacity\text{H}265+\text{channel capacity}H 265 + channel capacity", which denotes the combination of H264H264\text{H}264H 264 or H265H265\text{H}265H 265 and the optimal channel coding achieving channel capacity. FFmpeg is adpoted to simulate the video coding process of H264H264\text{H}264H 264 and H265H265\text{H}265H 265.

For DJSCC-based SC systems, we adopt the advanced DJSCC-based video transmission model DVST[11] to transmit videos. Since DVST only focuses on the coding of P-frames, DJSCC-based image compression neural networks are required to encode I-frames. For a simple and fair comparison, we assume that I-frames can be transmitted losslessly and DVST is used to encode P-frames based on lossless I-frames. We only consider the average bandwidth of P-frames. Then, the performance upper bound of the DJSCC-based SC system is obtained through the optimal video reconstruction with the minimum bandwidth.

3.2 Ablation Study

Refer to caption
(a) The content-adaptive method in TGIF-QA-R Action dataset.
Refer to caption
(b) The content-adaptive method in TGIF-QA-R Transition dataset.
Figure 6: Answer accuracy of different versions of VideoQA-SC versus the average BCR under fixed SNRs over the AWGN channel. Each line is trained with a particular SNR.
Refer to caption
(a) The SNR-adaptive method in TGIF-QA-R Action dataset.
Refer to caption
(b) The SNR-adaptive method in TGIF-QA-R Transition dataset.
Figure 7: Answer accuracy of different versions of VideoQA-SC versus SNRs over the AWGN channel.

VideoQA-SC enables dynamic bandwidth allocation under the guidance of multiple information to fully leverage limited bandwidth resources. We perform ablation experiments based on two bandwidth-adaptive methods (content-adaptive and SNR-adaptive bandwidth allocation) to verify their effectiveness.

We train and test our models over AWGN channels under fixed SNRs and mixed SNRs to validate the effectiveness of the two bandwidth-adaptive methods, respectively. The mixed training SNR range is from 55-5- 5 to 15151515 dB. Fig. 6LABEL:sub@Fig:A_cbr and LABEL:sub@Fig:T_cbr show the accuracy of different versions of VideoQA-SC constrained by various BCRs under 3333 SNRs (55-5- 5, 00, 10101010 dB). For each particular SNR, a content-adaptive VideoQA-SC is compared with a base VideoQA-SC which does not utilize rate predictors for bandwidth allocation.

It can be observed that under different BCR constraints, the accuracy of content-adaptive VideoQA-SC is consistently higher than that of the base VideoQA-SC. As the BCR decreases, the accuracy gain of content-adaptive VideoQA-SC gradually increases. It indicates that when the available bandwidth is extremely constrained, the content-adaptive bandwidth allocation can effectively utilize limited bandwidth resources to improve the overall performance of VideoQA tasks. Moreover, content-adaptive bandwidth allocation does not consider channel conditions, which allocates bandwidth to different tokens only based on video semantics. In scenarios with high SNRs, e.g., 10101010 dB, the accuracy gains from content-adaptive bandwidth allocation become more pronounced because video semantics can be transmitted more accurately with less channel noise. In particular, when the SNR is 10101010 dB and the BCR is constrained to 2.4×1052.4superscript1052.4\times 10^{-5}2.4 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT (the average retained channels for each token is 4444), the accuracy gain is about 1.89%percent1.891.89\%1.89 % for the content-adaptive VideoQA-SC in the TGIF-QA-R Action dataset.

By incorporating the estimated SNR into the rate predictors, VideoQA-SC can jointly consider video semantics and current channel conditions to make decisions, which is called SNR-adaptive bandwidth allocation. Fig. 7LABEL:sub@Fig:A_snr and LABEL:sub@Fig:T_snr illustrate the different bandwidth allocations of SNR-adaptive VideoQA-SC under various SNRs. For readability, The average BCR for each point in lines representing SNR-adaptive VideoQA-SCs is normalized by 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and marked near this point.

By observing lines representing SNR-adaptive VideoQA-SCs, a basic trend can be seen that as the SNR decreases, the SNR-adaptive VideoQA-SC tends to allocate more bandwidth for transmitting videos to resist stronger channel noise. Furthermore, two fixed-bandwidth base VideoQA-SCs are used for comparison with SNR-adaptive VideoQA-SCs with different trade-off parameter λ𝜆\lambdaitalic_λ. We train the two fixed-bandwidth VideoQA-SCs based on the highest and lowest allocated bandwidths of the SNR-adaptive VideoQA-SCs under different SNRs. As SNR decreases from 5555 dB to 55-5- 5 dB, the base VideoQA-SCs experience rapid performance degradation due to the fixed bandwidth allocation. In contrast, the SNR-adaptive VideoQA-SC is able to adaptively adjust the bandwidth allocation policy according to the current SNR, thereby achieving smooth performance degradation. Particularly, when the SNR is 55-5- 5 dB, the SNR-adaptive VideoQA-SC achieves an accuracy gain up to 2.10%percent2.102.10\%2.10 % compared with the base VideoQA-SC in the TGIF-QA-R Action dataset.

Another observation is that the SNR-adaptive VideoQA-SC not only dynamically adjusts the bandwidth allocation according to the SNR, but also outperforms the base VideoQA-SC with the same bandwidth in most cases. Fig. 7LABEL:sub@Fig:T_snr shows that SNR-adaptive VideoQA-SC outperforms even the base Video-SC with the highest fixed bandwidth under all SNRs in the TGIF-QA-R Transition dataset. This suggests that SNR-adaptive VideoQA-SC is still capable of accomplishing content-adaptive bandwidth allocation based on video semantics, which shows the powerful scalability of the proposed learning-based bandwidth allocation. By integrating SNR with video semantics for bandwidth allocation, SNR-adaptive VideoQA-SC achieves promising task performance while avoiding the need to train multiple models for specific channel conditions and the frequent model switching during practical application.

3.3 Performance comparison of different SC models

As depicted in Fig. 8 and 9, the proposed VideoQA-SC is compared with SSCC-based and DJSCC-based SC systems over the AWGN channel under different BCR constraints and SNRs. We use the optimized VideoQA model (g(;𝜻)𝑔𝜻g(\cdot;\boldsymbol{\zeta})italic_g ( ⋅ ; bold_italic_ζ ) and u(;𝝂)𝑢𝝂u(\cdot;\boldsymbol{\nu})italic_u ( ⋅ ; bold_italic_ν )) with lossless video and text to execute VideoQA tasks as the upper bound performance of the VideoQA-SC. The lower bound performance of the VideoQA-SC is obtained by performing VideoQA tasks with only text information. For fair comparison, the same VideoQA model optimized in training stage 1 is adopted in all comparison SC schemes. Content-adaptive and SNR-adaptive VideoQA-SCs are trained and tested under fixed SNRs and mixed SNRs, respectively.

Refer to caption
(a) SNR=0absent0=0= 0 dB.
Refer to caption
(b) SNR=5absent5=5= 5 dB.
Refer to caption
(c) SNR=10absent10=10= 10 dB.
Figure 8: Answer accuracy of different SC models versus average BCR over AWGN channels under different SNRs in TGIF-QA-R Action dataset.
Refer to caption
(a) SNR=0absent0=0= 0 dB.
Refer to caption
(b) SNR=5absent5=5= 5 dB.
Refer to caption
(c) SNR=10absent10=10= 10 dB.
Figure 9: Answer accuracy of different SC models versus average BCR over AWGN channels under different SNRs in TGIF-QA-R Transition dataset
Refer to caption
Figure 10: Answer accuracy of VideoQA-SC over Rayleigh fading channels with different fading parameters σhsubscript𝜎\sigma_{h}italic_σ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT.

By comparing Fig. 8LABEL:sub@Fig:A_0dB, LABEL:sub@Fig:A_5dB and LABEL:sub@Fig:A_10dB or Fig. 9LABEL:sub@Fig:T_0dB, LABEL:sub@Fig:T_5dB and LABEL:sub@Fig:T_10dB, it can be found that the proposed VideoQA-SC outperforms other SC schemes especially in low SNRs (00 dB). Since comparison SC systems need to restore the original video at the pixel level, they all have a turning point with the necessary transmission bandwidth and the corresponding minimum BCR Bminsubscript𝐵minB_{\text{min}}italic_B start_POSTSUBSCRIPT min end_POSTSUBSCRIPT. When the BCR constraint B<Bmin𝐵subscript𝐵minB<B_{\text{min}}italic_B < italic_B start_POSTSUBSCRIPT min end_POSTSUBSCRIPT, comparison SC systems lack the necessary information to reconstruct the original videos, which leads to a rapid deterioration of VideoQA performance. As the SNR decreases, Bminsubscript𝐵minB_{\text{min}}italic_B start_POSTSUBSCRIPT min end_POSTSUBSCRIPT gradually increases, which indicates that more bandwidths are required to guarantee the basic performance of SC systems when the channel conditions are worse. The DJSCC-based video transmission model DVST also employs a bandwidth-adaptive approach and achieves excellent video reconstruction performance, e.g., peak signal-to-noise ratio with the same bandwidth. However, there are slight bandwidth savings for DJSCC-based SC systems compared with SSCC-based SC systems for VideoQA tasks. It indicates that the adopted entropy-based bandwidth allocation limited to pixel-level video reconstruction is not effective for SC systems oriented to a particular intelligent task. Besides, although DJSCC-based SC systems can avoid the cliff effect and are robust to channel noise, they still face the risk of system collapse when B<Bmin𝐵subscript𝐵minB<B_{\text{min}}italic_B < italic_B start_POSTSUBSCRIPT min end_POSTSUBSCRIPT cannot be satisfied.

Different from comparison schemes at pixel-level, the proposed VideoQA-SC extracts and processes video semantics at the object and frame levels, and directly performs VideoQA tasks based on the reconstructed video semantics at the receiver, which results in significant bandwidth savings with guaranteed VideoQA performance. In this way, the minimum BCR Bminsubscript𝐵minB_{\text{min}}italic_B start_POSTSUBSCRIPT min end_POSTSUBSCRIPT of VideoQA-SC can be very low, which ensures the basic performance of VideoQA-SC under a wide range of BCR constraints. It can be seen that both types of VideoQA-SC achieve accuracy of over 53%percent5353\%53 % and 64%percent6464\%64 % under all testing SNR and BCR constraints in the TGIF-QA-R Action and Transition dataset, respectively. Benefiting from E2E joint training and SNR-adaptive bandwidth allocation, the proposed VideoQA-SC demonstrates robustness to noisy wireless channels while approaching the upper bound performance. In Fig. 8LABEL:sub@Fig:A_0dB, the accuracy of SNR-adaptive VideoQA-SC exceeds that of the DJSCC-based SC system by 5.17%percent5.175.17\%5.17 % with almost 12001200\frac{1}{200}divide start_ARG 1 end_ARG start_ARG 200 end_ARG of its bandwidth.

We further show the performance of VideoQA-SC over Rayleigh block fading channels with different fading parameters σhsubscript𝜎\sigma_{h}italic_σ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. For each 𝐬^vcsuperscriptsubscript^𝐬𝑣𝑐\hat{\mathbf{s}}_{v}^{c}over^ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, a corresponding hhitalic_h is sampled from the Rayleigh distribution Rayleigh(σh)Rayleighsubscript𝜎\operatorname{Rayleigh}(\sigma_{h})roman_Rayleigh ( italic_σ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ). We adjust the signal power so that the ratio of it to the simulated average noise power is 1111. Then, the statistical SNR and the current SNR are fed into rate predictors to improve the robustness of SNR-adaptive VideoQA-SC to Rayleigh block fading channels. Fig. 10 shows that VideoQA-SC is able to achieve slow and smooth degradation of VideoQA performance with the increasing fading, however, DJSCC-based SC fails to resist noise in Rayleigh block fading channels.

Fig. 11 provides a visual example to show the robustness of VideoQA-SC to the AWGN channel. DVST is adopted as the comparison scheme for video transmission. In this case, every 4444 frames are combined into one clip for semantic extraction and video coding. The first frame is used as I-frame to give the reference of the coding of P-frames for DVST.

When the channel SNR is 10101010 dB, both VideoQA-SC and DVST successfully transmit the video and help to predict the correct answer. As the SNR drops to 00 dB, DVST makes the same bandwidth allocation decision as when the SNR is 10101010 dB, failing to predict the correct answer. In contrast, due to the utilization of estimated SNR as a prior to aid bandwidth allocation, VideoQA-SC is able to dynamically adjust the number of transmission symbols allocated to each frame, ultimately overcoming the channel noise and predicting the correct answer. Although VideoQA-SC allocates more average transmission symbols for each frame under a low SNR, it is still much lower than the number of transmission symbols allocated by DVST.

Refer to caption
Figure 11: A visualization of VideoQA-SC robust to the channel noise. Each row represents the bandwidth allocation made by the corresponding model for all video frames over the AWGN channel with a specific SNR. Each column denotes the number of complex transmission symbols allocated for the corresponding frame in different cases. The correct answer is marked in red.

4 Conclusion

In this paper, we have proposed an E2E multimodal SC system, VideoQA-SC, to perform VideoQA tasks in wireless networks without relying on the reconstructed original source. Taking advantage of the efficient video semantic extraction and the learning-based bandwidth-adaptive DJSCC transmission, VideoQA-SC is able to fully leverage video semantic information to improve system performance for VideoQA tasks with high bandwidth-efficiency and noise-robustness. VideoQA-SC is trained in an E2E manner with the goal of maximize the VideoQA performance under bandwidth constraints. Experiments show that the proposed VideoQA-SC outperforms traditional SSCC-based and advanced DJSCC-based SC systems under a wide range of channel conditions and bandwidth constraints. In particular, VideoQA-SC improves 5.17%percent5.175.17\%5.17 % VideoQA accuracy while achieves nearly 99.5%percent99.599.5\%99.5 % bandwidth savings compared with the DJSCC-based SC system over AWGN channel when SNR is 00 dB, which demonstrates the great potential of task-oriented SC system design for video applications.

References

  • [1] Jiawei Shao, Yuyi Mao, and Jun Zhang. Task-Oriented Communication for Multidevice Cooperative Edge Inference. IEEE Transactions on Wireless Communications, 22(1):73–87, 2023.
  • [2] Mikolaj Jankowski, Deniz Gündüz, and Krystian Mikolajczyk. Wireless Image Retrieval at the Edge. IEEE Journal on Selected Areas in Communications, 39(1):89–100, 2021.
  • [3] Xuewen Luo, Hsiao-Hwa Chen, and Qing Guo. Semantic Communications: Overview, Open Issues, and Future Research Directions. IEEE Wireless Communications, 29(1):210–219, 2022.
  • [4] Hongyang Du, Jiacheng Wang, Dusit Niyato, Jiawen Kang, Zehui Xiong, Mohsen Guizani, and Dong In Kim. Rethinking Wireless Communication Security in Semantic Internet of Things. IEEE Wireless Communications, 30(3):36–43, 2023.
  • [5] Zi Qin Liew, Yanyu Cheng, Wei Yang Bryan Lim, Dusit Niyato, Chunyan Miao, and Sumei Sun. Economics of Semantic Communication System in Wireless Powered Internet of Things. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8637–8641, 2022.
  • [6] Jiawei Su, Zhixin Liu, Yuan-ai Xie, Kai Ma, Hongyang Du, Jiawen Kang, and Dusit Niyato. Semantic Communication-Based Dynamic Resource Allocation in D2D Vehicular Networks. IEEE Transactions on Vehicular Technology, 72(8):10784–10796, 2023.
  • [7] Le Xia, Yao Sun, Dusit Niyato, Daquan Feng, Lei Feng, and Muhammad Ali Imran. xURLLC-Aware Service Provisioning in Vehicular Networks: A Semantic Communication Perspective. IEEE Transactions on Wireless Communications, pages 1–1, 2023.
  • [8] Marios Kountouris and Nikolaos Pappas. Semantics-Empowered Communication for Networked Intelligent Systems. IEEE Communications Magazine, 59(6):96–102, 2021.
  • [9] Zhenzi Weng and Zhi** Qin. Semantic Communication Systems for Speech Transmission. IEEE Journal on Selected Areas in Communications, 39(8):2434–2444, 2021.
  • [10] **cheng Dai, Sixian Wang, Kailin Tan, Zhongwei Si, Xiaoqi Qin, Kai Niu, and ** Zhang. Nonlinear Transform Source-Channel Coding for Semantic Communications. IEEE Journal on Selected Areas in Communications, 40(8):2300–2316, 2022.
  • [11] Sixian Wang, **cheng Dai, Zijian Liang, Kai Niu, Zhongwei Si, Chao Dong, Xiaoqi Qin, and ** Zhang. Wireless Deep Video Semantic Transmission. IEEE Journal on Selected Areas in Communications, 41(1):214–229, 2023.
  • [12] Xiang Peng, Zhi** Qin, Danlan Huang, Xiaoming Tao, Jianhua Lu, Guangyi Liu, and Chengkang Pan. A Robust Deep Learning Enabled Semantic Communication System for Text. In GLOBECOM 2022 - 2022 IEEE Global Communications Conference, pages 2704–2709, 2022.
  • [13] Shengshi Yao, Kai Niu, Sixian Wang, and **cheng Dai. Semantic Coding for Text Transmission: An Iterative Design. IEEE Transactions on Cognitive Communications and Networking, 8(4):1594–1603, 2022.
  • [14] Tianxiao Han, Qianqian Yang, Zhiguo Shi, Shibo He, and Zhaoyang Zhang. Semantic-Preserved Communication System for Highly Efficient Speech Transmission. IEEE Journal on Selected Areas in Communications, 41(1):245–259, 2023.
  • [15] Zhenzi Weng, Zhi** Qin, Xiaoming Tao, Chengkang Pan, Guangyi Liu, and Geoffrey Ye Li. Deep Learning Enabled Semantic Communications With Speech Recognition and Synthesis. IEEE Transactions on Wireless Communications, 22(9):6227–6240, 2023.
  • [16] Jialong Xu, Bo Ai, Wei Chen, Ang Yang, Peng Sun, and Miguel Rodrigues. Wireless Image Transmission Using Deep Source Channel Coding With Attention Modules. IEEE Transactions on Circuits and Systems for Video Technology, 32(4):2315–2328, 2022.
  • [17] Jialong Xu, Bo Ai, Wei Chen, Ning Wang, and Miguel Rodrigues. Deep Joint Source-Channel Coding for Image Transmission With Visual Protection. IEEE Transactions on Cognitive Communications and Networking, 9(6):1399–1411, 2023.
  • [18] Jialong Xu, Tze-Yang Tung, Bo Ai, Wei Chen, Yuxuan Sun, and Deniz Gündüz. Deep Joint Source-Channel Coding for Semantic Communications. IEEE Communications Magazine, 61(11):42–48, 2023.
  • [19] Tze-Yang Tung and Deniz Gündüz. DeepWiVe: Deep-Learning-Aided Wireless Video Transmission. IEEE Journal on Selected Areas in Communications, 40(9):2570–2583, 2022.
  • [20] Jiawei Shao, Xinjie Zhang, and Jun Zhang. Task-Oriented Communication for Edge Video Analytics. IEEE Transactions on Wireless Communications, pages 1–1, 2023.
  • [21] Haopeng Li, Haonan Tong, Sihua Wang, Nuocheng Yang, Zhaohui Yang, and Changchuan Yin. Video Semantic Communication with Major Object Extraction and Contextual Video Encoding, 2024.
  • [22] Peiwen Jiang, Chao-Kai Wen, Shi **, and Geoffrey Ye Li. Wireless Semantic Communications for Video Conferencing. IEEE Journal on Selected Areas in Communications, 41(1):230–244, 2023.
  • [23] Chengsi Liang, Xiangyi Deng, Yao Sun, Runze Cheng, Le Xia, Dusit Niyato, and Muhammad Ali Imran. VISTA: Video Transmission over A Semantic Communication Approach. In 2023 IEEE International Conference on Communications Workshops (ICC Workshops), pages 1777–1782, 2023.
  • [24] Yaoyao Zhong, Wei Ji, Junbin Xiao, Yicong Li, Weihong Deng, and Tat-Seng Chua. Video Question Answering: Datasets, Algorithms and Challenges. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.
  • [25] Yicong Li, Xun Yang, An Zhang, Chun Feng, Xiang Wang, and Tat-Seng Chua. Redundancy-aware Transformer for Video Question Answering. Association for Computing Machinery, 2023.
  • [26] Junbin Xiao, Pan Zhou, Angela Yao, Yicong Li, Richang Hong, Shuicheng Yan, and Tat-Seng Chua. Contrastive Video Question Answering via Video Graph Transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11):13265–13280, 2023.
  • [27] Junbin Xiao, Angela Yao, Zhiyuan Liu, Yicong Li, Wei Ji, and Tat-Seng Chua. Video as Conditional Graph Hierarchy for Multi-Granular Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence, 36(3):2804–2812, 2022.
  • [28] Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Zero-Shot Video Question Answering via Frozen Bidirectional Language Models. In Advances in Neural Information Processing Systems, volume 35, pages 124–141, 2022.
  • [29] Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. Self-Chained Image-Language Model for Video Localization and Question Answering. In Advances in Neural Information Processing Systems, volume 36, pages 76749–76771, 2023.
  • [30] Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, and Tat-Seng Chua. Transformer-Empowered Invariant Grounding for Video Question Answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–12, 2023.
  • [31] Yunseok Jang, Yale Song, Youngjae Yu, Young** Kim, and Gunhee Kim. TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [32] Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance Estimation for Neural Network Pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [33] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, 2017.
  • [34] Liang Peng, Shuangji Yang, Yi Bin, and Guoqing Wang. Progressive Graph Attention Network for Video Question Answering. In Proceedings of the 29th ACM International Conference on Multimedia, page 2871–2879, 2021.