License: CC BY 4.0
arXiv:2401.12264v2 [eess.AS] 21 Feb 2024

CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing

Xianghu Yue*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, Xiaohai Tian, Lu Lu, Malu Zhang, Zhizheng Wu, Haizhou Li *{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPTThis work was done when Xianghu Yue was an intern in ByteDance.
Abstract

There has been a long-standing quest for a unified audio-visual-text model to enable various multimodal understanding tasks, which mimics the listening, seeing and reading process of human beings. Humans tends to represent knowledge using two separate systems: one for representing verbal (textual) information and one for representing non-verbal (visual and auditory) information. These two systems can operate independently but can also interact with each other. Motivated by this understanding of human cognition, in this paper, we introduce CoAVT – a novel cognition-inspired Correlated Audio-Visual-Text pre-training model to connect the three modalities. It contains a joint audio-visual encoder that learns to encode audio-visual synchronization information together with the audio and visual content for non-verbal information, and a text encoder to handle textual input for verbal information. To bridge the gap between modalities, CoAVT employs a query encoder, which contains a set of learnable query embeddings, and extracts the most informative audiovisual features of the corresponding text. Additionally, to leverage the correspondences between audio and vision with language respectively, we also establish the audio-text and visual-text bi-modal alignments upon the foundational audiovisual-text tri-modal alignment to enhance the multimodal representation learning. Finally, we jointly optimize CoAVT model with three multimodal objectives: contrastive loss, matching loss and language modeling loss. Extensive experiments show that CoAVT can learn strong multimodal correlations and be generalized to various downstream tasks. CoAVT establishes new state-of-the-art performance on text-video retrieval task on AudioCaps for both zero-shot and fine-tuning settings, audio-visual event classification and audio-visual retrieval tasks on AudioSet and VGGSound. The results demonstrate the effectiveness and superiority of the proposed model for multimodal processing.

Index Terms:
Multi-modal pretrain, representation learning, contrastive learning
publicationid: pubid: 0000–0000/00$00.00 © 2021 IEEE

I Introduction

Humans learn by reading, seeing and listening, which is a typical process of multimodal processing involving text, visual and audio content, and use the acquired knowledge to understand and interact with the world. Multimodal processing [1, 2, 3, 4, 5], which aims to learn the general knowledge across multiple modalities, has obtained much attention recently especially with the success of pretraining. Due to the high complexity and high training cost of multimodal models, most works focus on the processing of two modalities such as text and vision or text and audio. Just like humans benefiting from tri-modal content and their interactions between the modalities, multimodal understanding tasks rely on an effective tri-modal modeling.

In recent years, pre-training has witnessed a rapid development in multimodal processing, especially for two modalities. For example, Visual-Language (VL) pre-training models [6, 7, 5, 8, 9, 10], have shown superior performance on various text-video downstream tasks, such as text-video retrieval and video captioning. Similarly, Audio-Language (AL) pre-training models, like CLAP [11] and LAION [12], have capability to develop audio representation via contrastive learning [5] by combining audio data with natural language descriptions. Based on these bi-modal pre-training methods, we therefore take one step further to learn the general knowledge across three modalities of our daily perception, including audio, visual and text.

Building a unified audio-visual-text model capable of solving various multimodal understanding tasks is a long-standing challenge for multimodal processing research. Some recent works [1, 3, 13, 14] attempts to incorporate audio modality into VL pre-training for tri-modal understanding. However, a common approach in these efforts involves the utilization of three separate encoders for audio, vision and text, and then train it with pair-wise contrastive pre-text tasks. Although effective, they ignore the inherent alignment between audio and video modalities. We note that audio and video are two naturally time-aligned and closely related modalities of human perception [15, 16], offering different but complementary information. With separate and modality-dependent encoders, the synchronization information between audio and video may not be harnessed. A few recent works [17] employ a dual stream model (one stream being an audio-visual encoder and one stream being a text encoder) and solely train it with audiovisual-text contrastive loss for coarse-grained alignment. However, the modality gap between the three distinct modalities still exists and the inherent bi-modal correspondences (e.g., audio-text and visual-text) are not fully exploited.

When compared to machines, the human brain has an extraordinary ability to perceive and process multimodal information [18, 19]. Human cognition process is a useful tool of reference for machine multimodal representation learning. According to the dual coding theory [18], as illustrated in Figure 1, human cognition is unique in that it has become specialized for dealing simultaneously with language and with non-verbal objects and events. The theory assumes that there are two cognitive subsystems, one specialized for the representation and processing of non-verbal objects and events (i.e., auditory and imagery), and the other specialized for dealing with language. Moreover, the dual coding theory identifies two types of connections, one is the representational connection which represents the direct activation of verbal and non-verbal representations, and the other is the referential connection which represents the activation of the verbal system by the non-verbal system or vice-versa.

Refer to caption
Figure 1: The dual coding theory of human cognition proposed by Paivio [18].

Inspired by human cognition mechanisms, this study proposes a cognition-inspired unified audio-visual-text pre-training model, namely Correlated Audio-Visual-Text pre-training (CoAVT), to learn multimodal representations for solving various multimodal understanding tasks. CoAVT aims to build connections between different modalities and learn rich multimodal representation. Specifically, a joint audio-visual encoder is employed to handle audio and visual input simultaneously, thus leveraging the natural alignment between audio and video, and a text encoder for textual input. The joint audio-visual encoder is designed for non-verbal information, while the text encoder is for verbal information, similar to the two cognitive subsystems of human. To mitigate the modality gap, we propose a query encoder as a bridge between the joint audio-visual encoder and the text encoder. The query encoder contains a set of trainable query embeddings to interact with the joint audio-visual encoder and extract the most informative audiovisual representation of the corresponding text. Furthermore, to leverage the correspondences between audio and vision with language, we further build the audio-text and visual-text bi-modal alignments upon the foundational audiovisual-text tri-modal alignment simultaneously to enhance the multimodal representation learning. Finally, besides the simple contrastive loss for coarse-grained alignment, we jointly optimize CoAVT with matching loss for fine-grained alignment and causal language modeling loss for contextual coherence.

Extensive experiments are conducted on multiple downstream tasks, including text-video retrieval with different experimental setups (i.e. zero-shot and fine-tune), audio-visual event classification and audio-visual retrieval. The results demonstrate that our proposed CoAVT model is able to learn better cross-modal alignment, and consistently outperforms current SOTAs on the benchmark datasets of AudioCaps, AudioSet and VGGSound. It achieves an average performance improvement of 12.4% Recall@10 score and 1.8% Recall@1 score on the zero-shot and fine-tune setting of the retrieval datasets respectively, while the classification accuracy improvement is 2.5% mAP.

In summary, we make the following contributions:

  • We introduce CoAVT, a cognition-inspired unified pre-training model capable of solving various multimodal understanding tasks across multiple modalities, including audio, vision and text.

  • To mitigate the modality gap between the three modalities, we introduce the query encoder as a bridge for effective audiovisual-text alignment learning.

  • To fully exploit the inherent bi-modal correspondences, we build bi-modal audio-text and visual-text alignments upon the foundational audiovisual-text tri-modal alignment, which explicitly correlate the three modalities.

  • Our model achieves the state-of-the-art performance on text-video retrieval task on AudioCaps for both zero-shot and fine-tuning settings, audio-visual event classification and audio-visual retrieval tasks on AudioSet and VGGSound.

The remainder of this article is organized as follows. Section II reviews advances in bi-modal vision/audio-language and tri-modal audio-visual-text pre-training methods. Section III describes our proposed method. Section IV presents the experimental results and detailed analysis. The conclusion and future work are presented in Section V.

II Related work

In this section, we briefly revisit the bi-modal audio/vision-language pre-training models and tri-modal audio-visual-text pre-training models to set the stage of this work.

II-A Vision/Audio-Language Pre-training

Vision-language pre-training aims to learn bi-modal foundation models with improved performance on various vision-language downstream tasks, such as video retrieval, video question answering, and video captioning. Most of them [7, 20, 21, 6, 5, 22] focus on the visual-language alignment within the context of videos. Pioneering works such as VideoBERT [7] and CBT [20] explored the potential of joint visual-language representation via self-supervised learning. For fine-grained multimodal understanding, HERO [23] adopts a hierarchical structure to encoder video and text and employs a temporal-specific pre-text task, while UniVL [24] designs a generation pre-text task. ClipBERT [22] further introduces an end-to-end manner by inputting sparse sampled frames from video clips rather than densely extracted offline video features from full-length videos. BLIP-2 [10] bootstraps VL pre-training from off-the-shelf frozen pre-trained vision encoders and froze large language models. Collectively, these works well explore the correlation between vision and text modalities.

In a similar vein, audio-language pre-training [11, 12, 25, 26, 27] seeks to establish a profound comprehension of audio content by connecting audio and natural language. Following the success of CLIP [5] that learns image representations with natural language supervision, CLAP [11] and LAION [12] bring audio and text descriptions into a joint multimodal embedding space through contrastive learning, which are pre-trained on a large amount of audio-text pairs. BLAT [27] takes this a step further, bootstrap** audio-language pre-training with synthetic data. These works well capture the alignment and learn multimodal representation between audio and text modalities. Their efficacy has been demonstrated across various audio-language downstream tasks, such as audio retrieval, audio captioning and audio event classification.

Building on these advancements, our CoAVT takes a stride forward, aiming to learn multimodal representation among three modalities, including audio, visual and text.

II-B Audio-Visual-Text Pre-training

With the success of pre-training on two modalities, some recent works [1, 13, 28] try to incorporate audio modality into existing VL pre-training paradigms to achieve tri-modal understanding. For example, CLIP4VLA [1] and AudioCLIP [13] are proposed to extend the vision-language model CLIP [5] to accommodate audio modality for vision-language-audio multimodal processing. Based on these, VALOR [3] employs three separate encoders for audio, video, and text and pre-trains with contrastive loss and language modeling loss. VATT [2] introduces a hierarchical contrastive loss for text-video and video-audio alignment, but it targets at learning single-modality representations instead of improving cross-modality capability. Different above methods, AVR [17] designs a dual stream-model, one stream being an audio-visual encoder and on stream being a text encoder, which is trained with simple contrastive loss, specifically for the video retrieval task.

Our study builds upon these previous work but introduces two major differences in our dual-stream model. Firstly, to better handle the modality gap and enhance the referential connections between modalities, we employ a query encoder between the joint audio-visual encoder and the text encoder, to extract the most informative features of the text. Secondly, AVR model only focuses on the representational connections and ignores the correspondences between audio and vision with language while our CoAVT simultaneously exploits their intrinsic correlation for rich multimodal representation learning.

Refer to caption
Figure 2: The overview of our proposed CoAVT model, which consists of a joint audio-visual encoder, a text encoder and a query encoder, which contains a set of learnable query embeddings. The query encoder partly shares parameters with text encoder except the cross-attention layers. The red dashed box shows the pre-training objectives of our CoAVT, which are calculated on three pair-wise losses, including AV-T, A-T, and V-T. Each pair consists of contrastive loss, matching loss and language modeling loss.

III Methodology

In this section, we first introduce important constituent modules of our CoAVT model, we then present the pre-training objectives in detail. Given a batch of videos and their corresponding descriptions, we first extract the audios from videos and denote the audio batch, video batch and text batch as A𝐴Aitalic_A, V𝑉Vitalic_V, and T𝑇Titalic_T respectively. The goal of our CoAVT model is to effectively capture and exploit the underlying relationships between different modalities (e.g., audio, video and text) and finally to learn rich semantic representations among all three modalities. By doing so, the corresponding audio, video and text with similar semantics can be embedded close to each other despite being in different modalities, while pushing those with different semantics far away in the multimodal space.

With the multimodal representation fully learned during pre-training, we then fine-tune the model on different downstream tasks including cross-modal retrieval and multimodal event classification to verify the effectiveness.

III-A Model Architecture

Figure 2 presents an overview of our CoAVT model, which consists of a joint audio-visual encoder to handle audio and visual signals simultaneously, a text encoder to handle the textual data, and a query encoder to extract audio-visual information that is most informative of the text, thus bridging the modality gap between the three heterogeneous modalities. Details for each component are as follows.

III-A1 Joint Audio-Visual Encoder

Audio and video are two naturally aligned and closely related modalities of human perception, offering different but complementary information. Inspired by previous work [16, 15, 17], our joint audio-visual encoder incorporates an audio-visual encoder pair, and a shallow joint encoder layer. This design is intended to encode audio-visual synchronization information together with the audio and visual content, promoting enhanced cross-modal understanding and representation learning.

Given a video and its corresponding audio, we follow the pre-processing and tokenization in AST [29] and ViT [30] for audio and image inputs, respectively. Specifically, for audio, all audio clips are first randomly cropped or padded to 10 seconds, and then the 128-dimensional log Mel-filterbank features are extracted with a 25ms Hanning window every 10ms, which results in a 1024 ×\times× 128 spectrogram. We then split the spectrogram into 16 ×\times× 16 square patches 𝐚=[a1,a2,,a512]𝐚superscript𝑎1superscript𝑎2superscript𝑎512\mathbf{a}=[a^{1},a^{2},\cdots,a^{512}]bold_a = [ italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ⋯ , italic_a start_POSTSUPERSCRIPT 512 end_POSTSUPERSCRIPT ] as the input of the audio encoder. For video, we uniformly sample 10 RGB frames from each video, and randomly select one frame as input during pre-training. For each frame, we resize and center crop it to 224 ×\times× 224 size, and then split it into 16 ×\times× 16 square patches 𝐯=[v1,v2,,v196]𝐯superscript𝑣1superscript𝑣2superscript𝑣196\mathbf{v}=[v^{1},v^{2},\cdots,v^{196}]bold_v = [ italic_v start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ⋯ , italic_v start_POSTSUPERSCRIPT 196 end_POSTSUPERSCRIPT ] as the input of the visual encoder. The joint encoder employ a multi-stream forward pass strategy, in which we input the output of the audio encoder EAsuperscriptsubscript𝐸𝐴E_{A}^{{}^{\prime}}italic_E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT, the output of the visual encoder Evsuperscriptsubscript𝐸𝑣E_{v}^{{}^{\prime}}italic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT, and the concatenated audio-visual representation [EA,EV]superscriptsubscript𝐸𝐴superscriptsubscript𝐸𝑉[E_{A}^{{}^{\prime}},E_{V}^{{}^{\prime}}][ italic_E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ] in three independent forward passes and obtain the final audio-only embedding EAsubscript𝐸𝐴E_{A}italic_E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, visual-only embedding EVsubscript𝐸𝑉E_{V}italic_E start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT, and joint audio-visual embedding EAVsubscript𝐸𝐴𝑉E_{AV}italic_E start_POSTSUBSCRIPT italic_A italic_V end_POSTSUBSCRIPT.

III-A2 Text Encoder

We use BERTbase𝑏𝑎𝑠𝑒{}_{base}start_FLOATSUBSCRIPT italic_b italic_a italic_s italic_e end_FLOATSUBSCRIPT [31] as our text encoder. Given a text input of N𝑁Nitalic_N tokens, the text encoder outputs an embedding sequence {tcls,t1,,tN}subscript𝑡𝑐𝑙𝑠subscript𝑡1subscript𝑡𝑁\{t_{cls},t_{1},\cdots,t_{N}\}{ italic_t start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, with tidsubscript𝑡𝑖superscript𝑑t_{i}\in\mathbb{R}^{d}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and tclssubscript𝑡𝑐𝑙𝑠t_{cls}italic_t start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT is the embedding of the text [CLS] token. Following BERT, we choose the tclssubscript𝑡𝑐𝑙𝑠t_{cls}italic_t start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT as the global text representation.

III-A3 Query Encoder

To mitigate the modality gap, we propose to use the query encoder as a bridge between the joint audio-visual encoder and text encoder, and align the three modalities. The query encoder inserts one additional cross-attention layer between the self-attention layer and the feed-forward network for each transformer block of the text encoder. We first create a set number of predetermined learnable query embeddings as input to the query encoder. The queries can interact with each other through self-attention layers. More importantly, these queries can further interact with the embeddings (e.g., EA,EV,EAVsubscript𝐸𝐴subscript𝐸𝑉subscript𝐸𝐴𝑉E_{A},E_{V},E_{AV}italic_E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_A italic_V end_POSTSUBSCRIPT) from the joint audio-visual encoder through cross-attention layers, enabling better alignment between modalities. Except for cross-attention layers, the query encoder shares parameters with the text encoder, thereby these queries can additionally interact with the text through the same self-attention layers as well. The output of query encoder is QNq×C𝑄superscriptsubscript𝑁𝑞𝐶Q\in\mathbb{R}^{N_{q}\times C}italic_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT, where Nqsubscript𝑁𝑞N_{q}italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is the number of queries and C𝐶Citalic_C is the hidden size. For the inputs of EA,EV,EAVsubscript𝐸𝐴subscript𝐸𝑉subscript𝐸𝐴𝑉E_{A},E_{V},E_{AV}italic_E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_A italic_V end_POSTSUBSCRIPT, the outputs of query encoder are QA,QV,QAVsubscript𝑄𝐴subscript𝑄𝑉subscript𝑄𝐴𝑉Q_{A},Q_{V},Q_{AV}italic_Q start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_A italic_V end_POSTSUBSCRIPT, respectively.

III-B Pre-training

In this section, we introduce the pre-training objectives of our CoAVT model in detail. To conduct unified multimodal representation learning among audio, vision and text, we jointly optimize three objectives during pre-training, including contrastive loss, matching loss and language modeling loss. For each objective, we mainly consider three modality pairs including text-audio pair (T-A), text-visual pair (T-V), and text-audiovisual pair (T-AV). Details for each objective are as follows.

III-B1 Contrastive Loss

We first build the coarse-grained alignment between modality X𝑋Xitalic_X and text via X𝑋Xitalic_X-text contrastive learning, where X𝑋Xitalic_X represents different modalities including audio (A), visual (V) and joint audio-visual (AV). It aims to align the embedding space of the query encoder and the text encoder by encouraging positive X𝑋Xitalic_X-text pairs to have similar representations in contrast to the negative pairs.

Formally, we align the output query representation QXsubscript𝑄𝑋Q_{X}italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT with the text representation tclssubscript𝑡𝑐𝑙𝑠t_{cls}italic_t start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT. Since QXsubscript𝑄𝑋Q_{X}italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT contains multiple output embeddings (e.g., 16), we first compute the pairwise similarity between each query output qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and tclssubscript𝑡𝑐𝑙𝑠t_{cls}italic_t start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT, and then select the highest one as the similarity score:

s(X,T)=maxiNqgq(qi)gt(tcls)𝑠𝑋𝑇𝑚𝑎subscript𝑥𝑖subscript𝑁𝑞subscript𝑔𝑞subscript𝑞𝑖subscript𝑔𝑡subscript𝑡𝑐𝑙𝑠s(X,T)=max_{i\in N_{q}}g_{q}(q_{i})\cdot g_{t}(t_{cls})italic_s ( italic_X , italic_T ) = italic_m italic_a italic_x start_POSTSUBSCRIPT italic_i ∈ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT ) (1)

where gq()subscript𝑔𝑞g_{q}(\cdot)italic_g start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( ⋅ ) and gt()subscript𝑔𝑡g_{t}(\cdot)italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ) are linear projections that transforms the query embedding and [CLS] embedding to a common normalized low-dimensional space. The X𝑋Xitalic_X-text contrastive loss consists of two symmetric terms, one for X𝑋Xitalic_X-to-text classification:

X2T=logexp(s(Xi,Ti)/τ)j=1Bexp(s(Xi,Tj)/τ)subscript𝑋2𝑇𝑙𝑜𝑔𝑒𝑥𝑝𝑠subscript𝑋𝑖subscript𝑇𝑖𝜏superscriptsubscript𝑗1𝐵𝑒𝑥𝑝𝑠subscript𝑋𝑖subscript𝑇𝑗𝜏\mathcal{L}_{X2T}=-log\frac{exp(s(X_{i},T_{i})/\tau)}{\sum_{j=1}^{B}exp(s(X_{i% },T_{j})/\tau)}caligraphic_L start_POSTSUBSCRIPT italic_X 2 italic_T end_POSTSUBSCRIPT = - italic_l italic_o italic_g divide start_ARG italic_e italic_x italic_p ( italic_s ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_e italic_x italic_p ( italic_s ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG (2)

and the other for text-to-X𝑋Xitalic_X classification:

T2X=logexp(s(Ti,Xi)/τ)j=1Bexp(s(Ti,Xj)/τ)subscript𝑇2𝑋𝑙𝑜𝑔𝑒𝑥𝑝𝑠subscript𝑇𝑖subscript𝑋𝑖𝜏superscriptsubscript𝑗1𝐵𝑒𝑥𝑝𝑠subscript𝑇𝑖subscript𝑋𝑗𝜏\mathcal{L}_{T2X}=-log\frac{exp(s(T_{i},X_{i})/\tau)}{\sum_{j=1}^{B}exp(s(T_{i% },X_{j})/\tau)}caligraphic_L start_POSTSUBSCRIPT italic_T 2 italic_X end_POSTSUBSCRIPT = - italic_l italic_o italic_g divide start_ARG italic_e italic_x italic_p ( italic_s ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_e italic_x italic_p ( italic_s ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG (3)

where τ𝜏\tauitalic_τ is a learnable temperature parameter, and B𝐵Bitalic_B is the batch size. The final X-text contrastive loss is then denoted as: XTC=12(X2T+T2X)subscript𝑋𝑇𝐶12subscript𝑋2𝑇subscript𝑇2𝑋\mathcal{L}_{XTC}=\frac{1}{2}(\mathcal{L}_{X2T}+\mathcal{L}_{T2X})caligraphic_L start_POSTSUBSCRIPT italic_X italic_T italic_C end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( caligraphic_L start_POSTSUBSCRIPT italic_X 2 italic_T end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_T 2 italic_X end_POSTSUBSCRIPT ).

III-B2 Matching Loss

X𝑋Xitalic_X-text matching (XTM) is a binary classification task where the model is asked to predict whether X𝑋Xitalic_X-text is matched or unmatched given the corresponding query features and text features. Here we employ a bi-directional self-attention mask so that all queries and texts can attend to each other, which enables the query encoder to effectively capture the multimodal information between the text and modality X𝑋Xitalic_X through the query embeddings. Finally, we feed each output query embedding into a two-class linear classifier to obtain a matching probability pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and then average the probabilities across all queries as the final matching score:

XTM=j=1NqylogpiNqsubscript𝑋𝑇𝑀superscriptsubscript𝑗1subscript𝑁𝑞𝑦𝑙𝑜𝑔subscript𝑝𝑖subscript𝑁𝑞\mathcal{L}_{XTM}=\frac{\sum_{j=1}^{N_{q}}ylogp_{i}}{N_{q}}caligraphic_L start_POSTSUBSCRIPT italic_X italic_T italic_M end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_y italic_l italic_o italic_g italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_ARG (4)

where y𝑦yitalic_y is 1 when the input X𝑋Xitalic_X-text pair is matched and 0 otherwise.

III-B3 Language Modeling Loss

Language modeling aims to generate text given the representations of modality X𝑋Xitalic_X as the condition. Specifically, as there exists no direct interactions between the text encoder and the joint audio-visual encoder, the information required for generating the text must be first extracted through the query encoder, and then passed to the text tokens via self-attention layers. During this process, we employ a multimodal causal self-attention mask to control the interaction between the queries and text, in which the queries can only attend to each other but not the text tokens, and each text token can attend to both all queries and its preceding tokens. This masking strategy ensures a coherent and effective flow of information from the queries to the text, enabling the generation of contextually relevant and coherent text given the modality X𝑋Xitalic_X representations. This casual language modeling loss with modality X𝑋Xitalic_X as condition can be formulated as:

XLM=iLlogp(yi|y<i,QX)subscript𝑋𝐿𝑀superscriptsubscript𝑖𝐿𝑙𝑜𝑔𝑝conditionalsubscript𝑦𝑖subscript𝑦absent𝑖subscript𝑄𝑋\mathcal{L}_{XLM}=-\sum_{i}^{L}logp(y_{i}|y_{\textless i},Q_{X})caligraphic_L start_POSTSUBSCRIPT italic_X italic_L italic_M end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_l italic_o italic_g italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) (5)

where D𝐷Ditalic_D denotes the training batch, yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the current predicted token, and y<isubscript𝑦absent𝑖y_{\textless i}italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT represents the previous predicted tokens. The final pre-training objective for modality X𝑋Xitalic_X is the sum of above three objectives:

X=XTC+XTM+XLMsubscript𝑋subscript𝑋𝑇𝐶subscript𝑋𝑇𝑀subscript𝑋𝐿𝑀\mathcal{L}_{X}=\mathcal{L}_{XTC}+\mathcal{L}_{XTM}+\mathcal{L}_{XLM}caligraphic_L start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_X italic_T italic_C end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_X italic_T italic_M end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_X italic_L italic_M end_POSTSUBSCRIPT (6)

III-B4 Overall Pre-training Loss

Besides the foundational audiovisual-text tri-modal alignment, we further utilize the correspondences between audio and vision with language (e.g., audio-text and visual-text) to enhance the audiovisual-text alignment, thereby enabling better multimodal representation learning. Therefore, the overall pre-training objective of our CoAVT consists of three pair-wise losses based on Eq. (6):

total=AV+A+Vsubscript𝑡𝑜𝑡𝑎𝑙subscript𝐴𝑉subscript𝐴subscript𝑉\mathcal{L}_{total}=\mathcal{L}_{AV}+\mathcal{L}_{A}+\mathcal{L}_{V}caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_A italic_V end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT (7)

III-C Fine-tuning

To verify the effectiveness of the learned representations encompassing audio, visual and text, we further fine-tune the CoAVT model for downstream retrieval and classification tasks.

III-C1 Fine-tuning for Retrieval

Video retrieval aims to retrieve the relevant video segment given a free form natural language query. Unlike most existing video retrieval methods that solely focus on aligning text with visual elements and disregard audio information, our proposed model, empowered with tri-modality encoding ability, enables a holistic exploration of both visual and audio information for text-to-video retrieval. Moreover, we also consider audio retrieval for downstream evaluation. During the fine-tuning process, we adhere to the same objectives as pre-training. Without encoding audio information, existing video retrieval works only focus the matching between text and vision modality. Benefiting from the tri-modality encoding ability of our model, we fully explore both vision and audio information in the video for text-to-video retrieval.

III-C2 Fine-tuning for Audio-Visual Event Classification

Besides the retrieval task, audio-visual event classification is another challenging task on video understanding, which requires good joint audio-visual representation. To conduct classification, we apply average pooling on top of the query encoder followed by a randomly initialized linear layer. Specifically, we fine-tune the model using audio-only data (A), video-only data (V), and audio-visual data (AV) to evaluate the single modal and multi-modal representation quality.

TABLE I: Comparison with state-of-the-art video retrieval methods on AudioCaps dataset. Inputs refer to video inputs as follows: A: audio spectrogram, V: video frames. *{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPTVALOR uses four datasets for pre-training, containing VALOR-1M, WebVid-2.5M, CC14M, and HD_VILA_10M.
Method Pre-training #Example Modality R@1 R@5 R@10
Zero-shot
AVR [17] VideoCC3M 9.4M A 8.7 - 37.7
A+V 10.6 - 45.2
CoAVTBaseline𝐵𝑎𝑠𝑒𝑙𝑖𝑛𝑒{}_{Baseline}start_FLOATSUBSCRIPT italic_B italic_a italic_s italic_e italic_l italic_i italic_n italic_e end_FLOATSUBSCRIPT AudioSet 1.4M A 9.4 31.0 45.7
A+V 10.5 33.9 48.7
CoAVTVanilla𝑉𝑎𝑛𝑖𝑙𝑙𝑎{}_{Vanilla}start_FLOATSUBSCRIPT italic_V italic_a italic_n italic_i italic_l italic_l italic_a end_FLOATSUBSCRIPT A 13.0 36.1 50.9
A+V 13.1 36.9 51.7
CoAVT A 14.1 39.3 54.7
A+V 14.3 41.3 57.6
Fine-tuned
Oncescu et al. 2021 - - A 24.3 - 72.1
A+V 28.1 - 79.0
CLIP4VLA [1] AudioSet+HowTo100M 2.5M A 28.4 60.9 76.2
VALOR [3] VALOR*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 27.5M A 40.1 73.9 83.1
AVR [17] VideoCC3M 9.4M A 35.5 - 84.5
A+V 43.2 - 88.9
CoAVTBaseline𝐵𝑎𝑠𝑒𝑙𝑖𝑛𝑒{}_{Baseline}start_FLOATSUBSCRIPT italic_B italic_a italic_s italic_e italic_l italic_i italic_n italic_e end_FLOATSUBSCRIPT AudioSet 1.4M A 33.2 67.6 80.4
A+V 38.1 75.4 87.1
CoAVTVanilla𝑉𝑎𝑛𝑖𝑙𝑙𝑎{}_{Vanilla}start_FLOATSUBSCRIPT italic_V italic_a italic_n italic_i italic_l italic_l italic_a end_FLOATSUBSCRIPT A 36.9 71.5 83.5
A+V 45.0 79.1 89.4
CoAVT A 41.8 76.5 87.8
A+V 44.9 79.3 89.6

IV Experimental Results and Analysis

IV-A Datasets

To validate the proposed method, we first pre-train the CoAVT model on the large-scale multimodal dataset AudioSet [32], then fine-tune it for video retrieval, audio-visual event classification and audio-visual retreival tasks on four datasets: AudioCaps [33], Clotho [34], AudioSet-20K [32] and VGGSound [35]. The evaluation metrics are Recall@n (R@n) for retrieval task, and mean average precision (mAP) for classification task.

IV-A1 Pre-training Dataset

For pre-training, we use the publicly available large-scale multimodal dataset AudioSet [32]. It contains over 2 million 10-second YouTube video clips, and each clip is labeled with event labels from a set of 527 distinct labels in a non-exclusive way. After filtering out those unavailable data, we finally downloaded 1,450,529 audio-video-text pairs for training. To generate coherent captions from discrete labels, we opt for a simply concatenation without any prompt.

IV-A2 Fine-tuning Datasets

We evaluate the pre-trained CoAVT on retrieval and classification benchmarks, including AudioCaps [33], Clotho [34], AudioSet-20K [32] and VGGSound [35].

  • AudioCaps is an audio-centric video dataset, whose videos are mainly in event scenarios with duration shorter than 10 seconds from YouTube. Each training sample contains one caption, while five captions per sample are used in validation and test sets. We use this dataset for text-video retrieval task on both zero-shot and fine-tuning settings. After filtering out the videos that are no longer available, we finally obtain 32,747 training, 442 validation, 753 test samples.

  • Clotho is an audio-only dataset of described sounds, which are sourced from Freesound platform [36]. This dataset consists of a development set and evaluation set of 2893 and 1045 audio samples respectively, and every audio sample is accompanied by 5 captions. For fair comparison, we follow [37] and treat each of the 5 captions per test audio as a separate query. We use this dataset to validate the generalization ability of our CoAVT on text-to-audio retrieval.

  • AudioSet-20K and VGGSound  For the audio-visual event classification task, we conduct experiments on AudioSet-20K [32] and VGGSound [35]. AudioSet-20K is a subset of AudioSet-2M with a more balanced class distribution. We downloaded 18,063 training and 16,690 evaluation samples. VGGSound is a collection of 200K 10-second YouTube video clips annotated with 309 classes, and we downloaded 162,567 training and 13,483 test samples.

IV-B Experimental Settings

In this section, we provide more details of the input pre-processing, model parameters, and hyper-parameter settings during our experiments.

For the audio input, we first downsample the audio waveform to 16000 Hz, then extract 128-dimensional log Mel-filterbank features with a 25ms Hanning window every 10ms, which results in a 1024 × 128 spectrogram. For the visual input, to lower the computational overhead, we uniformly sample 10 RGB frames from each 10-second video clip (i.e., 1 FPS), and randomly select one frame as the input during training. During the inference on retrieval task, the same extraction procedure was performed, with the difference that only the central frame was presented to the model. While during the inference on audio-visual event classification task, we follow [16] and use a frame aggregation strategy, in which we average the model prediction of each frame as the final model prediction.

By default, all transformer encoder layers are 768-dimensional and have 12 attention heads. For the joint audio-visual encoder, the audio and visual encoders are 11-layer Transformer and the joint encoder is a single self-attention Transformer layer. We initialize the joint audio-visual encoder with CAV-MAE pre-trained weights [16]. For the query encoder and text encoder, besides the cross-attention layers, both of them share the same parameters, which are initialized using the BERTbase𝑏𝑎𝑠𝑒{}_{base}start_FLOATSUBSCRIPT italic_b italic_a italic_s italic_e end_FLOATSUBSCRIPT [31] model. We set the number of queries to 16 as default setting if not specified otherwise. To improve the training efficiency, we follow FLIP [38] to randomly mask out the audio and visual input and removes a large portion of patches during pre-training, with the probability of 0.75 and 0.5 for audio and video respectively as our default settings if not specified otherwise. While during fine-tuning, we do not apply masking on audio and visual input. We pre-train the model for 5 epochs using a batch size of 512 on 8 NVIDIA A100 GPUs. We use AdamW [39] optimizer with β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.98subscript𝛽20.98\beta_{2}=0.98italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.98, and a weight decay of 0.05 for both pre-training and fine-tuning. We use a cosine learning rate decay with a peak learning rate of 1e-4 and a linear warmup of 2k steps. The minimum learning rate at the second stage is 1e-6. For the fine-tuning on retrieval task, the learning rate is set to 3e-5 and the model is trained for 15 epochs with a batch size of 128 on 4 NVIDIA A100 GPUs. For the fine-tuning on event classification task, the learning rate is set to 8e-5 and the model is trained for 15 epochs with a batch 64 on one GPU.

TABLE II: Comparison with state-of-the-art video retrieval methods on AudioCaps dataset. Inputs refer to video inputs as follows: A: audio spectrogram, V: video frames. *{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPTVALOR uses four datasets for pre-training, containing VALOR-1M, WebVid-2.5M, CC14M, and HD_VILA_10M.
Method Pre-training AudioSet-20K (mAP) VGGSound (Acc)
A V A+V A V A+V
Audio-only Models
PANNS [40] - 27.8 - - - - -
AST [29] SL 34.7 - - - - -
SSAST [41] SSL 31.0 - - - - -
MAE-AST [42] SSL 30.6 - - - - -
Audio-MAE [43] SSL 37.1 - - - - -
Chen et al. 2020 - - - - 48.8 - -
AudioSlowFast [44] - - - - 50.1 - -
Audio-Visual Models
G-Blend*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT [45] - 29.1 22.1 37.8 - - -
MBT*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT [46] SL 31.3 27.7 43.9 52.3 51.2 64.1
CAV-MAE [16] SSL 37.7 19.8 42.0 59.5 47.0 65.5
MAViL [15] SSL 39.0 22.2 42.5 59.9 48.3 63.8
CoAVTVanilla𝑉𝑎𝑛𝑖𝑙𝑙𝑎{}_{Vanilla}start_FLOATSUBSCRIPT italic_V italic_a italic_n italic_i italic_l italic_l italic_a end_FLOATSUBSCRIPT SL 38.9 18.0 45.0 60.1 46.7 66.2
CoAVT SL 42.5 20.6 44.7 60.7 48.1 66.4

IV-C Results

IV-C1 Text-to-Video Retrieval

To demonstrate the effectiveness of our proposed CoAVT model, we first evaluate it for video retrieval on AudioCaps benchmark. During inference, we follow [8, 10], which first select k=128𝑘128k=128italic_k = 128 candidates based on the similarity scores, then followed by a re-ranking based on pairwise matching scores. The results are provided in Table I. We mainly compare the CoAVT with three tri-modal pre-training methods, including 1) CLIP4VLA [1], which incorporates audio into VL pre-training framework CLIP and trains on both AudioSet and HowTo100M datasets; 2) VALOR [3], which uses three separate encoders and trains on four large-scale datasets, containing 27.5M samples in total; 3) AVR [17], which uses a dual-stream model but trains with simple contrastive loss between audiovisual and text features on VideoCC3M dataset. For fair comparison, we build another two baseline models: Baseline𝐵𝑎𝑠𝑒𝑙𝑖𝑛𝑒Baselineitalic_B italic_a italic_s italic_e italic_l italic_i italic_n italic_e model use the same objective as AVR but trains with our AudioSet dataset, while Vanilla𝑉𝑎𝑛𝑖𝑙𝑙𝑎Vanillaitalic_V italic_a italic_n italic_i italic_l italic_l italic_a model employs a query encoder between the joint audiovisual encoder and text encoder, ignoring the bi-modal audio-text and visual-text alignments, and trains with three objectives on AudioSet.

TABLE III: Results on the Clotho dataset for text-audio retrieval. Note this dataset only contains audio information, no visual track. *{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPTVALOR uses four datasets for pre-training, containing VALOR-1M, WebVid-2.5M, CC14M, and HD_VILA_10M, resulting in total 27.5M samples.
Method Pre-training Fine-tuning R@1 R@5 R@10
Oncescu. [37] - Clotho 9.6 - 40.1
LAION [12] Clotho + AudioCaps 12.0 31.6 43.9
AVR [17] VideoCC3M - 3.0 - 17.5
Clotho 12.6 - 45.4
VALOR [3] VALOR* Clotho 17.5 42.7 55.3
CoAVT AudioSet - 4.7 13.8 20.3
Clotho 13.7 33.8 44.9
Clotho + AudioCaps 16.4 37.9 50.0

Table I compares the video retrieval performances of different approaches on the AudioCaps corpus. The results show that the proposed CoAVT model achieves state-of-the-art performance with significant improvement over existing methods in both zero-shot and fine-tuned settings, either with only audio representations or joint audiovisual representations. When compared to the strong baseline AVR model, our Baseline𝐵𝑎𝑠𝑒𝑙𝑖𝑛𝑒Baselineitalic_B italic_a italic_s italic_e italic_l italic_i italic_n italic_e model obtains similar performance under the zero-shot setting (e.g., 10.6/45.3 \rightarrow 10.5/48.7), despite that it is pre-trained on a much larger video-text dataset [17], which contains more than 9M samples. Furthermore, our Vanilla𝑉𝑎𝑛𝑖𝑙𝑙𝑎Vanillaitalic_V italic_a italic_n italic_i italic_l italic_l italic_a model, which employs the query encoder to address the modality gap, outperforms the AVR model (e.g., 10.6/45.3 \rightarrow 14.3/57.6), which demonstrates the effectiveness of the query encoder. Finally, when our final model is further equipped with audio-text and video-text correspondences during pre-training, we obtain the state-of-the-art retrieval results. This indicates that our model is capable of learning more effective semantic representations across audio, vision and text modalities.

We further evaluate our model by conducting audio-only retrieval on the Clotho dataset. As the results shown in Table III, our CoAVT model outperforms the AVR model, but falls short of the VALOR model. It should be noted that VALOR model uses a significantly larger dataset during pre-training, with up to 27.5M samples, compared to the 1.4M samples used in the pre-training of our CoAVT model. Despite this difference in pre-training sample size, the CoAVT model still demonstrates competitive performance, underscoring its efficiency and potential.

IV-C2 Audio-Visual Event Classification

To further validate whether our CoAVT model could learn a good joint audio-visual representation, we conduct audio-visual event classification experiments. Table II summarizes the performance comparison on AudioSet and VGGSound datasets. We report accuracy for fine-tuning using the audio(A), video(V) and joint audio-visual representation (A+V). Our CoAVT sets new state-of-the-art performance on audio-only and audio-visual classification on both AudioSet-20K and VGGSound datasets. As shown in the table, our Vanilla𝑉𝑎𝑛𝑖𝑙𝑙𝑎Vanillaitalic_V italic_a italic_n italic_i italic_l italic_l italic_a model already outperforms recent competitive models, such as CAV-MAE [16] and MAViL [15] by a large margin (e.g., 42.0/42.5 \rightarrow 45.0 and 65.5/63.8 \rightarrow 66.2) on audio-visual based event classification. Furthermore, when we include the audio-text and visual-text bi-modal information during pre-training, our CoAVT also surpasses these models on audio-only based event classification (e.g., 37.7/39.0 \rightarrow 42.5 and 59.5/59.9 \rightarrow 60.7). This indicates that our CoAVT model not only captures a good joint audio-visual representation, but also effectively learns audio-only representation.

These results on video retrieval and audio-visual event classification validate the effectiveness of our CoAVT model in handling multimodal understanding tasks and highlight its potential for improving performance in both audio-visual and audio-only scenarios. The inclusion of bi-modal audio-text and visual-text bi-modal correspondences during pre-training provides additional benefits, indicating that the integration of bi-modal information can lead to more robust and versatile multimodal representations.

TABLE IV: Visual-to-audio retrieval results on the subset of AudioSet and VGGSound.
Method Pre-training AudioSet VGGSound
R@1 R@5 R@10 R@1 R@5 R@10
Visual-to-Audio Retrieval
CAV-MAE [16] AudioSet 18.8 39.5 50.1 14.8 34.2 44.0
CoAVTVanilla𝑉𝑎𝑛𝑖𝑙𝑙𝑎{}_{Vanilla}start_FLOATSUBSCRIPT italic_V italic_a italic_n italic_i italic_l italic_l italic_a end_FLOATSUBSCRIPT 27.9 55.1 65.4 30.9 61.8 72.4
CoAVT 32.1 61.5 72.4 40.9 71.2 81.8
Visual-to-Audio Retrieval
CAV-MAE [16] AudiosSet 15.1 34.0 43.0 12.8 30.4 40.3
CoAVTVanilla𝑉𝑎𝑛𝑖𝑙𝑙𝑎{}_{Vanilla}start_FLOATSUBSCRIPT italic_V italic_a italic_n italic_i italic_l italic_l italic_a end_FLOATSUBSCRIPT 22.7 48.3 57.6 29.4 56.3 67.5
CoAVT 33.1 63.6 73.0 36.6 66.1 78.1

IV-C3 Audio-Visual Retrieval

In this section, we further investigate whether our CoAVT model also learns a well-coordinated representation that captures audio-visual correspondence via text for audio-visual retrieval task. For fair comparison, we use the same subset of AudioSet and VGGSound evaluation set as in  [16], which includes 1,725 and 1,545 audio-visual samples from the AudioSet and VGGSound evaluation sets, respectively. Specifically, we input audio and image to the model in two independent forward passes and take the mean-pooled query encoder outputs as audio and visual representation, respectively. We then calculate the retrieval recall at rank 1,5 and 10 (R@1, R@5, R@10) based on the cosine similarity of the audio and visual representation. The results are shown in Table IV. As demonstrated in the table, compared to CAV-MAE, which learns the joint audio-visual representation through self-supervised learning, our model could leverage additional textual information as a bridge and achieves significantly better audio-visual retrieval performance. This result indicates that our CoAVT model is not only capable of learning high-quality joint audio-visual representations, but also excels in coordinating these representations to capture audio-visual correspondences effectively. The use of text as a bridge between audio and visual modalities enhances the model’s performance in the audio-visual retrieval task.

TABLE V: Effects of the number of learnable queries of the query encoder during pre-training.
# query Modality R@1 R@5 R@10
8 A 11.4 33.4 47.5
A+V 12.9 35.9 50.3
16 A 13.0 36.1 50.9
A+V 13.1 36.9 51.7
32 A 10.8 31.7 46.6
A+V 10.4 33.6 49.1
TABLE VI: Effects of the number of learnable queries of the query encoder during pre-training.
Masking ratio Modality R@1 R@5 R@10
no mask A 12.4 37.4 52.5
A+V 13.3 38.8 55.4
ma=0.75 mv=0.75 A 13.5 41.0 55.9
A+V 14.3 41.1 57.3
ma=0.75 mv=0.5 A 14.1 39.3 54.7
A+V 14.3 41.3 57.6
ma=0.5 mv=0.75 A 13.1 39.3 54.5
A+V 13.7 40.0 56.0

IV-D Ablation Study

In this section, we conduct ablation experiments to further understand the contributions of different components of our CoAVT model. The experiments are conducted on the zero-shot video retrieval task using AudioCaps dataset.

IV-D1 Effect of Number of Queries

The learnable query embeddings of the query encoder plays an important role in extracting the audio-visual features that are informative of the corresponding text. To investigate the effect of the number of queries, we vary the number from 8 to 32 during pre-training based on our Vanilla𝑉𝑎𝑛𝑖𝑙𝑙𝑎Vanillaitalic_V italic_a italic_n italic_i italic_l italic_l italic_a model. The results are shown in Table V. From these results, it becomes clear that the number of queries has a significant impact on the model’s performance. The model achieves the best performance when the number of queries is set to 16. Interestingly, when the number of queries is increased from 16 to 32, the performance degrades dramatically. This might be due to that increasing the number of queries will introduce more noise and redundancy in the model. These additional queries may capture irrelevant or redundant information, which can interfere with the learning process and hinder the model’s ability to extract meaningful multimodal representations, ultimately leading to a decrease in performance.

IV-D2 Effect of Masking Ratio

Masking is a crucial technique used during pre-training, which involves randomly hiding a portion of the input data and prompting the model to predict the masked data based on the context provided by the unmasked data. This encourages the model to learn robust and generalizable representations of the data. In our CoAVT model, we perform both audio and video masking during pre-training. To investigate the impact of masking, we conduct an ablation study by varying the masking ratio. As shown in Table VI, the model yields best performance when the masking ratio of audio is 0.75 and the masking ratio of video is 0.5. Compared to the models that do not use any masking, the model with masking not only makes the pre-training process more efficient but also improves the generalization ability of the model.

IV-D3 Effect of Audio-Text and Visual-Text Alignment

To enhance the multimodal representation learning, we introduce the audio-text and visual-text bi-modal alignments in addition to the base audiovisual-text tri-modal alignment in the CoAVT model. These alignments aim to capture the rich semantic correspondences between different modalities, thus improving the model’s ability to understand and integrate multimodal information. In order to evaluate the contribution of audio-text and visual-text alignment separately, we conduct an ablation study where we remove these alignments one by one during pre-training. The results are shown in Table VII. When we remove the visual-text alignment, the overall performance drops, especially in the video-only based retrieval task. This indicates that the visual-text alignment plays a crucial role in capturing the semantic correspondences between visual and textual data, which is particularly important for tasks that rely heavily on visual information. When we further remove the audio-text alignment, the performance of retrieval tasks degrades accordingly. This suggests that the audio-text alignment also contributes significantly to the model’s performance, hel** to establish meaningful correspondences between audio and textual data.

Overall, these results reveal that the bi-modal correspondences help the model capture rich semantic representations between modalities, leading to better performance in downstream tasks. This underscores the importance of incorporating bi-modal alignments in multimodal representation learning models.

Refer to caption
Figure 3: Qualitative results of video-to-text retrieval on AudioCaps.
TABLE VII: Effects of the correspondence of audio-text and visual-text.
Pre-training loss Modality R@1 R5 R@10
totalsubscript𝑡𝑜𝑡𝑎𝑙\mathcal{L}_{total}caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT V 5.6 19.9 30.9
A 12.4 37.4 52.5
A+V 13.3 38.8 55.4
 -Vsubscript𝑉\mathcal{L}_{V}caligraphic_L start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT V 3.0 13.7 23.7
A 13.1 37.4 52.5
A+V 13.6 37.9 53.6
 -Vsubscript𝑉\mathcal{L}_{V}caligraphic_L start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT-Asubscript𝐴\mathcal{L}_{A}caligraphic_L start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT V 3.7 14.5 24.4
A 13.0 36.1 50.9
A+V 13.1 36.9 51.7

IV-D4 Qualitative Analysis

In Figure 3, we show some qualitative results of our CoAVT and the baseline𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒baselineitalic_b italic_a italic_s italic_e italic_l italic_i italic_n italic_e model on video-to-text retrieval task. Specifically, we present the query videos with matched texts. Our CoAVT model demonstrates a stronger cross-modal video-text understanding capability compared to the baseline model. For example, consider the second video in the figure. While the baseline model returns a more generic or less accurate description, the CoAVT model returns the text “an aircraft motor”, which is a more specific and accurate description of the audiovisual content in the video. These qualitative results suggest the superiority of our CoAVT model over the baseline model in the downstream task. This reinforces the effectiveness of our approach, which leverages bi-modal and tri-modal alignments for enhanced multimodal representation learning. The results also highlight the model’s ability to capture fine-grained semantic correspondences between video and text, which is crucial for various downstream tasks.

V Conclusion

This study aimed to develop a new tri-modal audio-visual-text pre-training method for multimdodal processing. The proposed CoAVT model employs a joint audio-visual encoder to handle audio and visual input simultaneously, and a text encoder for textual input. This dual model has two subsystems, one is for non-verbal information (e.g., audio and video), and one is for verbal information (e.g., text), similar to human cognition mechanisms. CoAVT introduces a query encoder to bridge the modality gap, thereby enabling better alignment of the representations from different modality. It further establishes the bi-modal alignments upon the base audiovisual-text tri-modal alignment, which explicitly correlate audio, video and text modalities. The effectiveness of the proposed pre-training method was validated on three downstream tasks, including video retrieval, audio-visual event classification, and audio-visual retrieval tasks. Extensive experimental results clearly demonstrated, as a unified audio-visual-text model, its consistent superiority for multimodal understanding.

References

  • [1] L. Ruan, A. Hu, Y. Song, L. Zhang, S. Zheng, and Q. **, “Accommodating audio modality in clip for multimodal processing,” in Proc. AAAI Conf. Artif. Intell., 2023, pp. 9641–9649.
  • [2] H. Akbari, L. Yuan, R. Qian, W.-H. Chuang, S.-F. Chang, Y. Cui, and B. Gong, “VATT: Transformers for multimodal self-supervised learning from raw video, audio and text,” in Proc. Adv. Neural Inf. Process. Syst., 2021.
  • [3] S. Chen, X. He, L. Guo, X. Zhu, W. Wang, J. Tang, and J. Liu, “VALOR: Vision-audio-language omni-perception pretraining model and dataset,” in arXiv:2304.08345, 2022.
  • [4] B. Chen, A. Rouditchenko, K. Duarte, H. Kuehne, S. Thomas, A. Boggust, R. Panda, B. Kingsbury, R. Feris, D. Harwath, M. P. James Glass, and S.-F. Chang, “Multimodal clustering networks for self-supervised learning from unlabeled videos,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, p. 8012–8021.
  • [5] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 8748–8763.
  • [6] L. Zhu and Y. Yang, “ActBERT: Learning global-local video-text representations,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2020, p. 8746–8755.
  • [7] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, “VideoBERT: A joint model for video and language representation learning,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 7464–7473.
  • [8] D. Li, J. Li, H. Li, J. C. Niebles, and S. C. Hoi, “Align and prompt: Video-and-language pre-training with entity prompts,” in Proc. Eur. Conf. Comput. Vis., 2022, p. 4953–4963.
  • [9] J. Li, D. Li, C. ** language-image pre-training for unified vision-language understanding and generation,” in Proc. Int. Conf. Mach. Learn., 2022.
  • [10] J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrap** language-image pre-training with frozen image encoders and large language models,” in Proc. Int. Conf. Mach. Learn., 2023.
  • [11] B. Elizalde, S. Deshmukh, M. A. Ismail, and H. Wang, “CLAP: Learning audio concepts from natural language supervision,” in arXiv:2206.04769, 2022.
  • [12] Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023.
  • [13] A. Guzhov, F. Raue, J. Hees, and A. Dengel, “AudioCLIP: Extending clip to image, text and audio,” in GCPR, 2021.
  • [14] J. Liu, X. Zhu, F. Liu, L. Guo, Z. Zhao, M. Sun, W. Wang, H. Lu, S. Zhou, J. Zhang, and J. Wang, “OPT: Omni-perception pre-trainer for cross-modal understanding and generation,” in arXiv:2107.00249, 2021.
  • [15] P.-Y. Huang, V. Sharma, H. Xu, C. Ryali, H. Fan, Y. Li, S.-W. Li, G. Ghosh, J. Malik, and C. Feichtenhofer, “Mavil: Masked audio-video learners,” in Proc. Adv. Neural Inf. Process. Syst., 2023.
  • [16] Y. Gong, A. Rouditchenko, A. H. Liu, D. Harwath, L. Karlinsky, H. Kuehne, and J. Glass, “Contrastive audio-visual masked autoencoder,” in Proc. Int. Conf. Learn. Represent., 2023.
  • [17] A. Nagrani, P. H. Seo, B. Seybold, A. Hauth, S. Manen, C. Sun, and C. Schmid, “Learning audio-video modalities from image captions,” in Proc. Eur. Conf. Comput. Vis., 2022.
  • [18] J. M. Clark and A. Paivio, “Dual coding theory and education,” vol. 3, pp. 149–210, 1991.
  • [19] A. Paivio, “Imagery and verbal processes,” 1979.
  • [20] C. Sun, F. Baradel, K. Murphy, and C. Schmid, “Learning video representations using contrastive bidirectional transformer,” in Proc. Int. Conf. Learn. Represent., 2019.
  • [21] H. Tan and M. Bansal, “LXMERT: Learning cross-modality encoder representations from transformers,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2019, p. 5100–5111.
  • [22] J. Lei, L. Li, L. Zhou, Z. Gan, T. L. Berg, M. Bansal, and J. Liu, “Less is more: Clipbert for video-and-language learning via sparse sampling,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 7331–7341.
  • [23] L. Li, Y.-C. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu, “Hero: Hierarchical encoder for video+language omni-representation pre-training,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2020, p. 2046–2065.
  • [24] H. Luo, L. Ji, B. Shi, H. Huang, N. Duan, T. Li, J. Li, T. Bharti, and M. Zhou, “Univl: A unified video and language pre-training model for multimodal understanding and generation,” in arXiv:2002.06353, 2020.
  • [25] X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y. Zou, and W. Wang, “Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,” in arXiv:2303.17395, 2023.
  • [26] Y. Xin, D. Yang, and Y. Zou, “Improving text-audio retrieval by text-aware attention pooling and prior matrix revised loss,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023.
  • [27] X. Xu, Z. Zhang, Z. Zhou, P. Zhang, Z. ** language-audio pre-training based on audioset tag-guided synthetic data,” in arXiv:2303.07902, 2023.
  • [28] H. H. Wu, P. Seetharaman, K. Kumar, and J. P. Bello, “Wav2CLIP: Learning robust audio representations from clip,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 4563–4567.
  • [29] Y. Gong, Y.-A. Chung, and J. Glass, “AST: Audio spectrogram transformer,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2021, pp. 571–575.
  • [30] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Represent., 2020.
  • [31] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019, p. 4171–4186.
  • [32] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audioset: An ontology and human-labeled dataset for audio events,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2017, pp. 776–780.
  • [33] C. D. Kim, B. Kim, H. Lee, and G. Kim, “AudioCaps: Generating captions for audios in the wild,” in NAACL, 2019, p. 119–132.
  • [34] K. Drossos, S. Lip**, and T. Virtanen, “Clotho: An audio captioning dataset,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, p. 736–740.
  • [35] H. Chen, W. Xie, A. Vedaldi, and A. Zisserman, “VGGSound: A large-scale audio-visual dataset,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, p. 721–725.
  • [36] F. Font, G. Roma, , and X. Serra, “Freesound technical demo,” in ACMM, 2013, p. 411–412.
  • [37] A.-M. Oncescu, A. S. Koepke, J. F. Henriques, Z. Akata, and S. Albanie, “Audio retrieval with natural language queries,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2021.
  • [38] Y. Li, H. Fan, R. Hu, C. Feichtenhofer, and K. He, “Scaling language-image pre-training via masking,” in arXiv:2212.00794, 2023.
  • [39] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. Int. Conf. Learn. Represent., 2018.
  • [40] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” p. 2880–2894, 2020.
  • [41] Y. Gong, C.-I. J. Lai, Y.-A. Chung, and J. Glass, “SSAST: Self-supervised audio spectrogram transformer,” in Proc. AAAI Conf. Artif. Intell., 2022, pp. 10 699–10 709.
  • [42] A. Baade, P. Peng, and D. Harwath, “MAE-AST: Masked autoencoding audio spectrogram transformer,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2022, pp. 2438–2442.
  • [43] P.-Y. Huang, H. Xu, J. Li, A. Baevski, M. Auli, W. Galuba, F. Metze, and C. Feichtenhofer, “Masked autoencoders that listen,” in Proc. Adv. Neural Inf. Process. Syst., 2022.
  • [44] E. Kazakos, A. Nagrani, A. Zisserman, and D. Damen, “Slow-fast auditory streams for audio recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 855–859.
  • [45] W. Wang, D. Tran, and M. Feiszli, “hat makes training multi-modal classification networks hard?” in IEEE Conf. Comput. Vis. Pattern Recognit., 2020, p. 12695–12705.
  • [46] A. Nagrani, S. Yang, A. Arnab, A. Jansen, C. Schmid, and C. Sun, “Attention bottlenecks for multimodal fusion,” in Proc. Adv. Neural Inf. Process. Syst., 2021, p. 14200–14213.