\interspeechcameraready\name

ZhiyongYan \nameHeinrichDinkel \nameYongqingWang \nameJizhongLiu \nameJunboZhang \nameYujunWang \nameBinWang

Bridging Language Gaps in Audio-Text Retrieval

Abstract

Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multilingual text encoder (SONAR) to encode the text data with language-specific information. Additionally, we optimize the audio encoder through the application of consistent ensemble distillation (CED), enhancing support for variable-length audio-text retrieval. Our methodology excels in English audio-text retrieval, demonstrating state-of-the-art (SOTA) performance on commonly used datasets such as AudioCaps and Clotho. Simultaneously, the approach exhibits proficiency in retrieving content in seven other languages with only 10% of additional language-enhanced training data, yielding promising results.

keywords:
Audio-text retrieval, Contrastive learning, CLAP, Multilingual

1 Introduction

Audio-text retrieval, requiring the search for an audio clip or a caption within a database, based on a query from another modality, has seen significant advancements and applications in recent years. The integration of audio and text has facilitated various applications such as content-based audio search [1], and multimedia information retrieval. Audio-text retrieval is also one of the tasks featured in the Detection and Classification of Acoustic Scenes and Events (DCASE) competition [2]. A widely adopted technique in this field is Contrastive Language-Audio Pretraining (CLAP) [3, 4, 5] inspired by CLIP [6, 7, 8], which has demonstrated remarkable success in learning robust representations for audio-text retrieval tasks.

One significant limitation of current audio-text retrieval systems is their focus on monolingual retrieval, often restricted to single-language queries such as English. While there are datasets with non-English captions, such as [9], these datasets are small and often contain other errors such as imprecise annotations. However, advancements in multilingual text translation technology and the growing availability of open-source tools, such as OpusMT [10] and NLLB [11] have made it feasible to perform large-scale multilingual audio-text retrieval. This is achieved by leveraging automatic translation for data augmentation. Research in the multilingual AAC [12] has validated the viability of this method. Their proposed solution, however, suffers from limited language scalability, noting a lack of comprehensive evaluation regarding their performance across various languages.

In the realm of existing audio-text retrieval systems, various audio encoders have been employed, each with its strengths and limitations. HTSAT [13, 14], Audio-MAE in FLAP [15] and Cacophony [16] offer a promising alternative, particularly in capturing long-range dependencies in audio sequences. However, all these encoders struggle modeling variable-length audio segments. These limitations highlight the need for novel approaches to enhance the performance and adaptability of audio encoders in multilingual audio-text retrieval systems.

To address these challenges, this paper presents two primary contributions.

  • We incorporate language enhancement (LE) into retrieval tasks, employing a multilingual text encoder. SONAR [17], featuring a comprehensive suite of speech and text encoders and decoders, is one of the eligible candidates. We utilize its text-decoder for the generation of multilingual training data and its text encoder for multilingual text encoding, thereby bridging language gaps in the field.

  • We optimize the audio encoder through the application of CED [18] to overcome performance limitations when dealing with variable-length audio-text retrieval.

The experimental results indicate that a moderate portion of multilingual training serves as a form of data augmentation for standalone English audio-text retrieval, leading to a significant improvement in performance. We also achieve state-of-the-art (SOTA) results on widely used datasets such as AudioCaps and Clotho in English audio-text retrieval, demonstrating proficiency in retrieving content across seven additional languages.

Refer to caption

Figure 1: The proposed multilingual audio-text retrieval framework. We first generate multilingual text descriptions of the training data using the SONAR text decoder, displayed on the left. Then we train a multilingual audio-retrival model based on CLAP, which can be seen on the right. Models are evaluated by translating test-captions using ChatGPT.

2 Methodology

The details of the multilingual audio-text retrieval are illustrated in Figure 1. It consists of two primary components: the offline preparation of multilingual data and the model training framework.

Multilingual Data Preparation

A multilingual text translator is employed to translate the English descriptions from the training set into seven additional languages. Considering that each audio clip in the Clotho training dataset [19] is associated with multiple captions, a single one is randomly selected for each language translation. Additionally, each translated caption is annotated with a language prompt, such as eng, fra, deu and so forth.

During each training epoch, a subset of the multilingual descriptions is sampled from the translated text data randomly and added to the training set. These samples are then combined with the original English descriptions to form multilingual audio-text pairs using the same audio. The performance of the subset sampled at different percentages is shown in Section 4.5.

Training Framework

The essence of the audio-text retrieval task lies in comparing the similarity between the audio and text modalities, with CLAP [3] being one of the most commonly used techniques to achieve this. It employs a bi-encoder architecture comprising an audio encoder EAsubscript𝐸𝐴{E_{A}}italic_E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, a text encoder ETsubscript𝐸𝑇{E_{T}}italic_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, and a cross-modal matching module [20]. These encoders transform an audio-text pair (𝒜,𝒯)𝒜𝒯(\mathcal{A},\mathcal{T})( caligraphic_A , caligraphic_T ) into an embedding pair (ea,et)subscript𝑒𝑎subscript𝑒𝑡(e_{a},e_{t})( italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), which are subsequently linked in a joint cross-modal space using linear projections. This space is trained through contrastive learning, leveraging the (dis)similarity of audio and text pairs within a batch [14].

Similar methodologies are employed in the multilingual audio-text retrieval task. The audio-text pairs spanning multiple languages are fed into a shared text encoder that facilitates multilingual text encoding, bolstered by the addition of a language prompt. This process is termed as language enhancement (LE). We also introduce the concept of mixture LE, where the audio-text pairs encompass all seven additional languages detailed in this paper. We slightly modify the text encoder ETsubscriptE𝑇{\text{E}_{T}}E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to EMTsubscriptE𝑀𝑇{\text{E}_{MT}}E start_POSTSUBSCRIPT italic_M italic_T end_POSTSUBSCRIPT, denoting its adaptation for multilingual text processing:

easubscript𝑒𝑎\displaystyle e_{a}italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT =EA(𝒜),absentsubscriptE𝐴𝒜\displaystyle=\text{E}_{A}(\mathcal{A}),= E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( caligraphic_A ) , (1)
etsubscript𝑒𝑡\displaystyle e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =EMT(𝒯),absentsubscriptE𝑀𝑇𝒯\displaystyle=\text{E}_{MT}(\mathcal{T}),= E start_POSTSUBSCRIPT italic_M italic_T end_POSTSUBSCRIPT ( caligraphic_T ) ,
a𝑎\displaystyle aitalic_a =ProjectA(ea),absentsubscriptProject𝐴subscript𝑒𝑎\displaystyle=\text{Project}_{A}(e_{a}),= Project start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) ,
t𝑡\displaystyle titalic_t =ProjectMT(et).absentsubscriptProject𝑀𝑇subscript𝑒𝑡\displaystyle=\text{Project}_{MT}(e_{t}).= Project start_POSTSUBSCRIPT italic_M italic_T end_POSTSUBSCRIPT ( italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

The similarity score (cosine similarity in this system) between a𝑎aitalic_a and t𝑡titalic_t is computed as:

sAMT=aptpTaptpsubscript𝑠similar-to𝐴𝑀𝑇subscript𝑎𝑝superscriptsubscript𝑡𝑝𝑇normsubscript𝑎𝑝normsubscript𝑡𝑝s_{A\sim MT}=\frac{a_{p}\cdot t_{p}^{T}}{||a_{p}||\cdot||t_{p}||}italic_s start_POSTSUBSCRIPT italic_A ∼ italic_M italic_T end_POSTSUBSCRIPT = divide start_ARG italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⋅ italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG | | italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | | ⋅ | | italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | | end_ARG (2)

The InfoNCE loss [21] is adopted as the loss function. This contrastive training loss between the similarity scores and the ground truth labels is calculated as follows:

iAMTsuperscriptsubscript𝑖𝐴𝑀𝑇\displaystyle\mathcal{L}_{i}^{A\longrightarrow MT}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A ⟶ italic_M italic_T end_POSTSUPERSCRIPT =logexp(sAMT(i,i)/τ)j=1Nexp(sAMT(i,j)/τ),absentsubscript𝑠similar-to𝐴𝑀𝑇𝑖𝑖𝜏superscriptsubscript𝑗1𝑁subscript𝑠similar-to𝐴𝑀𝑇𝑖𝑗𝜏\displaystyle=-\log{\frac{\exp(s_{{A\sim MT}}(i,i)/\tau)}{{\textstyle\sum_{j=1% }^{N}\exp(s_{{A\sim MT}}(i,j)/\tau)}}},= - roman_log divide start_ARG roman_exp ( italic_s start_POSTSUBSCRIPT italic_A ∼ italic_M italic_T end_POSTSUBSCRIPT ( italic_i , italic_i ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( italic_s start_POSTSUBSCRIPT italic_A ∼ italic_M italic_T end_POSTSUBSCRIPT ( italic_i , italic_j ) / italic_τ ) end_ARG , (3)
iMTAsuperscriptsubscript𝑖𝑀𝑇𝐴\displaystyle\mathcal{L}_{i}^{MT\longrightarrow A}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M italic_T ⟶ italic_A end_POSTSUPERSCRIPT =logexp(sAMT(i,i)/τ)j=1Nexp(sAMT(j,i)/τ),absentsubscript𝑠similar-to𝐴𝑀𝑇𝑖𝑖𝜏superscriptsubscript𝑗1𝑁subscript𝑠similar-to𝐴𝑀𝑇𝑗𝑖𝜏\displaystyle=-\log{\frac{\exp(s_{{A\sim MT}}(i,i)/\tau)}{{\textstyle\sum_{j=1% }^{N}\exp(s_{{A\sim MT}}(j,i)/\tau)}}},= - roman_log divide start_ARG roman_exp ( italic_s start_POSTSUBSCRIPT italic_A ∼ italic_M italic_T end_POSTSUBSCRIPT ( italic_i , italic_i ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( italic_s start_POSTSUBSCRIPT italic_A ∼ italic_M italic_T end_POSTSUBSCRIPT ( italic_j , italic_i ) / italic_τ ) end_ARG ,
\displaystyle\mathcal{L}caligraphic_L =1Ni=1N(iAMT+iMTA),absent1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑖𝐴𝑀𝑇superscriptsubscript𝑖𝑀𝑇𝐴\displaystyle=\frac{1}{N}\sum_{i=1}^{N}(\mathcal{L}_{i}^{A\longrightarrow MT}+% \mathcal{L}_{i}^{MT\longrightarrow A}),= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A ⟶ italic_M italic_T end_POSTSUPERSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M italic_T ⟶ italic_A end_POSTSUPERSCRIPT ) ,

where τ𝜏\tauitalic_τ is a temperature hyper-parameter.

In our work, the model architecture primarily consists of SONAR-TE (SONAR text encoder) as the text encoder EMTsubscriptE𝑀𝑇{\text{E}_{MT}}E start_POSTSUBSCRIPT italic_M italic_T end_POSTSUBSCRIPT and CED as the audio encoder EAsubscriptE𝐴{\text{E}_{A}}E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT.

3 Experiments

3.1 Dataset

In our experiments, we use the AudioCaps [22] and Clotho [19] and WavCaps [14] datasets. The AudioCaps contains about 49,000 audio samples, each lasting around 10 seconds. Each audio is associated with a single sentence in the training set, while in the validation and test sets, each audio has five annotated sentences. The Clotho consists of 6,974 audio samples, ranging from 15 to 30 seconds in length, and each audio sample is annotated with five sentences. The dataset is divided into 3,839 training samples, 1,045 validation samples, and 1,045 test samples. WavCaps is a large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions. Its main data sources include four parts: FreeSound, BBC Sound Effects, SoundBible, and the Strongly-Labelled Subset of AudioSet.

Furthermore, we perform automatic translation of the training datasets from the AudioCaps and Clotho datasets into seven languages for the training of multilingual audio-text retrieval, utilizing a multilingual text translator based on the SONAR text decoder.

3.2 Models

Audio Encoder

For the Audio Encoder, we use the recently introduced CED-Base model [18]. CED-Base is a standard 86 M parameter vision transformer that has been trained on Audioset [23] via knowledge distillation from a large teacher ensemble. The model uses 64-dimensional Mel-spectrograms as inputs computed from a 16 kHz signal. Then it extracts non-overlap** 16×16161616\times 1616 × 16 patches from the Mel-spectrogram, which results in 4×62=2484622484\times 62=2484 × 62 = 248 patches over an input of 10s. In our experiments, applying a patch dropout of 25% on both frequency and time patches yields better results while also accelerating the training speed.

Text Encoder

The core of multilingual audio-text retrieval lies in the text encoder’s capacity to process multilingual texts. In this study, we exclusively use SONAR-TE [17]. SONAR-TE extracts a single vector bottleneck to represent the entire text, without utilizing token-level cross-attention found in standard sequence-to-sequence MT architectures. The fixed-size text representation is computed by pooling the token-level outputs of the encoder. In subsequent sections, SONAR simply represents the text encoder.

3.3 Setup

Our training dataset is divided into two types: small and large, where the small contains AudioCaps and Clotho and the large contains WavCaps, AudioCaps, and Clotho. We use ChatGPT 3.5 to translate the captions of the AudioCaps and Clotho test sets into seven different languages, including French (fre), German (deu), Spanish (spa), Dutch (nld), Catalan (cat), Japanese (jpn), and Chinese (zho). These serve as the test sets for the multilingual audio-text retrieval task.

This paper’s experiments are organized as follows. We first compare the impact of audio encoders by training on the small dataset. Next, we train various models using different LE on the small dataset, evaluating their impact on the English test sets, and simultaneously implement multilingual audio-text retrieval with the mixture LE. After pretraining on the large dataset, the models are fine-tuned on the AudioCaps and Clotho datasets, incorporating the proposed mixture LE approach.

All models are trained for 20 epochs with a batch size of 128 and a learning rate of 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT using the Adam optimizer, except during fine-tuning where a smaller learning rate 5×1065superscript1065\times 10^{-6}5 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT is needed. The temperature hyperparameter τ𝜏\tauitalic_τ is set to 0.07 for all settings. The source code is publicly available111https://github.com/zyyan4/ml-clap.

3.4 Evaluation metrics

In audio-text retrieval tasks, the evaluation of model performance relies on the recall at rank k (R@k). For a query, R@k is 1 if the target value item appears in the top k retrieved items, otherwise 0. The final R@k is averaged across the dataset [14]. Furthermore, this study introduces the mean average precision at rank 10 (mAP10) metric to offer a more comprehensive comparison of the model’s performance variations.

Table 1: Comparison of audio encoder based on SONAR-SE and CED respectively.
Audio Encoder AudioCaps Clotho
Audio-to-Text Text-to-Audio Audio-to-Text Text-to-Audio
R@1 mAP10 R@1 mAP10 R@1 mAP10 R@1 mAP10
SONAR-SE 0.0 0.0 14.3 31.7 0.0 0.0 12.5 32.9
CED 50.1 37.4 40.7 55.8 21.1 15.5 18.8 29.8

4 Results

4.1 Audio encoder comparison

Since SONAR itself features a speech encoder (SONAR-SE), this experiment assesses whether this encoder is suited as an audio encoder for retrieval. The results in Table 1 indicate that SONAR-SE is not suitable as an audio encoder for audio-text retrieval tasks. SONAR-SE shows a strong correlation between speech and text, whereas general audio used in this work exhibits a different pattern. Therefore we use CED as our default audio encoder in the rest of the paper.

4.2 Evaluation of LE on English

In this section, we demonstrate the impact of enhancing English retrieval through the different LE, as shown in Table 2. LE notably enhances the performance of English retrieval, with improvements of up to about 3% across multiple different languages absolute for both R@1 and mAP10 metrics. Notably, R@1 in Audio-to-Text on AudioCaps achieves over a 6% absolute improvement by using LE with Catalan. Further, when training with mixture LE, a remarkable performance improvement is also seen.

Table 2: Performance impact of LE on the (original) English test sets, where “baseline” indicates no enhancement and “mixture” denotes the proposed approach.
LE AudioCaps Clotho
Audio-to-Text Text-to-Audio Audio-to-Text Text-to-Audio
R@1 mAP10 R@1 mAP10 R@1 mAP10 R@1 mAP10
baseline 50.1 37.4 40.7 55.8 21.1 15.5 18.8 29.8
fra 53.8 38.9 42.1 57.2 24.3 16.4 19.8 30.7
deu 53.1 39.6 42.3 57.6 24.3 16.7 20.1 30.8
spa 52.3 39.6 43.2 57.8 25.5 16.7 19.8 30.9
nld 52.7 39.5 42.5 57.6 25.2 16.7 19.3 30.5
cat 56.3 40.3 43.7 58.4 24.0 16.6 19.8 30.8
jpn 54.1 39.8 43.4 58.2 24.5 16.7 19.7 30.8
zho 52.5 39.6 42.4 57.4 23.3 16.4 19.2 30.6
mixture 53.8 39.6 42.4 57.4 24.6 16.4 18.9 29.9

Refer to caption

Figure 2: Multilingual evaluation results on AudioCaps, where the x-axis represents the tested target language, with translations obtained by ChatGPT. The baseline model represents training on the original, English captions, whereas “proposed” represents using mixture LE. These observations are consistent with Clotho.
Table 3: A comparison between our proposed method against previous approaches on English test sets of AudioCaps and Clotho. Results in gray represent the multimodal model. For all results, higher is better and best results are highlighted in bold.
Model Training Type AudioCaps Clotho
Audio-to-Text Text-to-Audio Audio-to-Text Text-to-Audio
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
CLAP-HTSAT [24] Pretraining 41.9 73.1 84.6 34.6 70.2 82.0 20.0 44.9 58.7 16.7 41.1 54.1
LAION [25] 45.8 80.9 91.6 36.1 71.8 83.9 25.7 51.5 63.4 18.2 42.5 54.4
LAION (fusion) [25] 45.8 80.9 91.6 35.1 71.5 83.6 25.7 51.5 63.4 18.2 42.5 54.4
CNN14-BERT [14] 44.6 76.3 86.2 34.7 69.1 82.5 25.9 52.6 65.8 21.2 46.4 59.4
HTSAT-BERT [14] 51.7 82.3 90.6 39.7 74.5 86.1 23.4 50.9 63.4 19.5 45.2 58.2
HTSAT-22+GPT2 [26] 42.5 - - 35.6 - - 22.9 - - 15.7 - -
FLAP [15] 51.5 82.5 92.5 40.4 74.7 85.0 21.6 51.2 63.1 17.4 41.3 53.7
FLAP (fusion) [15] 53.0 84.1 92.6 41.5 75.5 86.0 25.5 53.4 67.9 20.3 46.5 58.8
BLAT [27] 40.4 - 85.7 33.3 - 82.4 13.9 - 48.2 12.3 - 46.1
OnePeace [28] 51.0 81.9 92.0 42.5 77.5 88.4 27.1 52.3 65.4 22.4 49.0 62.7
Cacophony [16] 55.3 83.6 92.4 41.0 75.3 86.4 26.5 54.1 67.3 20.2 45.9 58.8
CED+BERT 52.0 84.0 91.3 39.0 75.3 87.3 28.0 55.8 70.4 23.1 50.0 64.3
Proposed 55.7 81.9 90.8 40.4 75.4 87.1 29.3 53.6 68.0 23.6 50.9 64.9
CNN14-NetRVLAD [29] Fine-tuning 33.3 67.6 80.6 29.3 65.2 79.3 13.0 32.9 45.4 13.1 33.1 45.1
BLAT [27] 47.5 - 87.6 38.2 - 85.1 17.9 - 50.9 13.7 - 48.9
CNN14-BERT [14] 45.7 76.1 87.7 35.1 70.0 82.1 27.1 52.7 66.3 21.5 47.9 61.9
HTSAT-BERT [14] 54.6 85.2 92.4 42.2 76.5 87.1 26.9 52.6 64.9 19.7 45.7 59.4
Proposed 59.3 86.3 94.0 45.6 81.0 90.5 30.5 58.4 70.7 24.7 53.6 67.0
   + mixture LE 60.7 86.9 94.8 45.9 81.3 90.2 30.9 57.5 70.2 25.0 53.7 66.6

4.3 Multilingual Capabilities

In Section 4.2, training with the mixture LE approach equips the model with multilingual audio-text retrieval capabilities. It yields improved performance on multilingual test sets compared to the base model trained solely on English captions, as depicted in Figure 2. Performance for most languages noticeably improves across all tested languages. However, the retrieval performance for Japanese is suboptimal, primarily due to the complexity of the Japanese text encoder’s tokenizer. Future work may involve adjusting the proportion of Japanese data in the training set to enhance the existing language ratios.

4.4 Comparison against previous works

In Table 3, we compare our proposed approach against previous methods for audio-text retrieval on English test sets.

During the pretraining with the large dataset, the CED model shows a significant improvement in modeling variable-length audio (Clotho test set) compared to HTSAT, utilizing the same BERT-based text encoder. Additionally, the utilization of the SONAR text encoder further enhances the audio-text retrieval performance, demonstrating superior overall average performance compared to the current SOTA models. Notably, our work outperforms previous approaches that utilized additional training data on Clotho, by a significant margin.

With only pretraining, our results on Text-to-Audio slightly underperform in terms of R@1 against previous approaches. However, in terms of Audio-to-Text performance, our approach largely outperforms previous attempts. Upon fine-tuning, substantial performance gains are observed, particularly notable in the AudioCaps test set. The mixture LE also contributes to enhanced performance during the fine-tuning phase, with most metrics on both the AudioCaps and Clotho test sets reaching the SOTA. This comparative analysis against SOTA models highlights the efficacy of the proposed approach in modeling audio and text relationships across languages. The findings underscore the potential benefits of leveraging multilingual data and advanced text encoders for develo** robust audio-text retrieval systems.

4.5 Ablation studies

Table 4: The performance of mixture LE on the English test set under different data mixing ratios.
mix AudioCaps Clotho
Audio-to-Text Text-to-Audio Audio-to-Text Text-to-Audio
R@1 mAP10 R@1 mAP10 R@1 mAP10 R@1 mAP10
10% 53.8 39.6 42.4 57.4 24.6 16.4 18.9 29.9
20% 53.3 40.3 41.8 57.4 23.2 16.1 19.8 31.1
30% 52.3 39.6 42.8 57.8 23.4 16.0 19.0 30.0
40% 51.5 39.1 41.4 56.7 22.7 15.3 18.2 29.3
50% 51.1 38.5 41.5 56.5 22.7 15.3 18.4 29.7

We explore the impact of varying LE mixing ratios on the performance of the English test set during training, as shown in Table 4. Our experimental findings suggest using mixing ratios between 10% and 30%, with 10% adopted in our experiments. Beyond 30%, it adversely affects the model’s mAP10 performance. This is primarily due to the utilization of the same audio for multilingual audio-text pairs but with different text captions. A higher mixing ratio results in a greater number of text captions per audio. For instance, at a 40% mixing ratio, one English caption plus seven additional languages equates to an average of 3.8 text captions per audio. This increased complexity poses challenges for contrastive learning. When incorporating more language categories, we recommend reducing the mixing ratio to mitigate its impact on the multilingual audio-text retrieval model.

5 Conclusion

In this work, we introduce LE, a simple text augmentation approach for audio-text retrieval, aiming to enable multilingual audio-text retrival. We showcase the effectiveness of employing both single-language and mixed-language enhancement for this task. The results on the English caption test set demonstrate significant improvements, laying a strong foundation for multilingual audio-text retrieval. Our exploration across various languages yields promising outcomes, with the incorporation of the mixture LE achieving SOTA results. This model also exhibits robust multilingual retrieval capabilities, enhancing its utility for real-world applications.

References

  • [1] E. Wold, T. Blum, D. Keislar, and J. Wheaten, “Content-based classification, search, and retrieval of audio,” IEEE MultiMedia, vol. 3, no. 3, pp. 27–36, 1996.
  • [2] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley, “Detection and classification of acoustic scenes and events,” IEEE Transactions on Multimedia, vol. 17, no. 10, pp. 1733–1746, 2015.
  • [3] B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  • [4] A. Guzhov, F. Raue, J. Hees, and A. Dengel, “Audioclip: Extending clip to image, text and audio,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 976–980.
  • [5] H.-H. Wu, P. Seetharaman, K. Kumar, and J. P. Bello, “Wav2clip: Learning robust audio representations from clip,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 4563–4567.
  • [6] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  • [7] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu, “Coca: Contrastive captioners are image-text foundation models,” arXiv preprint arXiv:2205.01917, 2022.
  • [8] J. Li, D. Li, C. ** language-image pre-training for unified vision-language understanding and generation,” in International conference on machine learning.   PMLR, 2022, pp. 12 888–12 900.
  • [9] M. Wu, H. Dinkel, and K. Yu, “Audio caption: Listen and tell,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 830–834.
  • [10] J. Tiedemann and S. Thottingal, “Opus-mt – building open translation services for the world,” in European Association for Machine Translation Conferences/Workshops, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:221097277
  • [11] M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard et al., “No language left behind: Scaling human-centered machine translation,” arXiv preprint arXiv:2207.04672, 2022.
  • [12] M. Cousin, E. Labbé, and T. Pellegrini, “Multilingual audio captioning using machine translated data,” arXiv preprint arXiv:2309.07615, 2023.
  • [13] K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov, “Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 646–650.
  • [14] X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y. Zou, and W. Wang, “Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,” arXiv preprint arXiv:2303.17395, 2023.
  • [15] C.-F. Yeh, P.-Y. Huang, V. Sharma, S.-W. Li, and G. Gosh, “Flap: Fast language-audio pre-training,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).   IEEE, 2023, pp. 1–8.
  • [16] G. Zhu and Z. Duan, “Cacophony: An improved contrastive audio-text model,” arXiv preprint arXiv:2402.06986, 2024.
  • [17] P.-A. Duquenne, H. Schwenk, and B. Sagot, “Sentence-level multimodal and language-agnostic representations,” arXiv preprint arXiv:2308.11466, 2023.
  • [18] H. Dinkel, Y. Wang, Z. Yan, J. Zhang, and Y. Wang, “Ced: Consistent ensemble distillation for audio tagging,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2024, pp. 291–295.
  • [19] K. Drossos, S. Lip**, and T. Virtanen, “Clotho: An audio captioning dataset,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 736–740.
  • [20] H. Sun, Z. Yan, Y. Wang, H. Dinkel, J. Zhang, and Y. Wang, “Leveraging multi-task training and image retrieval with clap for audio captioning,” in Proc. Conf. Detection Classification Acoust. Scenes Events Challenge, 2023, pp. 1–4.
  • [21] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
  • [22] C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” in North American Chapter of the Association for Computational Linguistics, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:174799768
  • [23] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2017, pp. 776–780.
  • [24] S. Deshmukh, B. Elizalde, and H. Wang, “Audio retrieval with wavtext5k and clap training,” arXiv preprint arXiv:2209.14275, 2022.
  • [25] Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  • [26] B. Elizalde, S. Deshmukh, and H. Wang, “Natural language supervision for general-purpose audio representations,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2024, pp. 336–340.
  • [27] X. Xu, Z. Zhang, Z. Zhou, P. Zhang, Z. ** language-audio pre-training based on audioset tag-guided synthetic data,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 2756–2764.
  • [28] P. Wang, S. Wang, J. Lin, S. Bai, X. Zhou, J. Zhou, X. Wang, and C. Zhou, “One-peace: Exploring one general representation model toward unlimited modalities,” arXiv preprint arXiv:2305.11172, 2023.
  • [29] S. Lou, X. Xu, M. Wu, and K. Yu, “Audio-text retrieval in context,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 4793–4797.