\interspeechcameraready\name

[affiliation=1,2]ZeyuXie \name[affiliation=1]XuenanXu \name[affiliation=2,3]ZhizhengWu \name[affiliation=1]MengyueWu

PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation

Abstract

Recently, audio generation tasks have attracted considerable research interests. Precise temporal controllability is essential to integrate audio generation with real applications. In this work, we propose a temporal controlled audio generation framework, PicoAudio. PicoAudio integrates temporal information to guide audio generation through tailored model design. It leverages data crawling, segmentation, filtering, and simulation of fine-grained temporally-aligned audio-text data. Both subjective and objective evaluations demonstrate that PicoAudio dramantically surpasses current state-of-the-art generation models in terms of timestamp and occurrence frequency controllability. The generated samples are available on the demo website https://PicoAudio.github.io.

keywords:
audio generation, data simulation, temporal control, timestamp control, occurrence frequency control

1 Introduction

11footnotetext: Mengyue Wu is the corresponding author.

Recently, significant progress has been made in audio generation. With the advancement of diffusion models, we can now synthesize vivid and lifelike audio segments [1, 2, 3, 4, 5]. A single model can generate universal audio, including speech, sound effects, and music [6, 7]. Some researchers are focusing on controllability, such as text-based audio editing or style transfer [8, 9], scene control for speech and sound effects [6], attributes-driven generation [10, 11], and the generation of extended, variable-length spatial music and sound [12].

Although existing models can generate sound by following instructions, when using audio generation models in content creation applications, it’s important to control timestamps and the occurrence frequencies of acoustic events precisely. Existing models overlook the temporal controllability of the timestamp, interval, duration, occurrence frequency, and relations like overlap or precedence. For example, most models struggle to produce sound occurrences accurately when given text inputs like “dog barks three times” or timestamps such as “bird chir** during 4-6 seconds”. These limitations significantly affect the models’ practical use in generating temporally-controllable audio content.

We argue that the missing of precise controllability in existing audio generation models has their root in the following two aspects: First, the deficiency of temporal control is partially due to insufficient temporally-aligned audio-text data. The commonly utilized audio-text datasets, such as AudioCaps [13] and Clotho [14], emphasize the fidelity of sound event descriptions and the linguistic sophistication of textual content, but they lack annotations pertaining to temporal aspects. In particular, in the largest audio captioning dataset AudioCaps, the phrase “xx times”, indicating frequency, appears only in 1086/56796(1.9%)annotated108656796absentpercent1.91086/56796(\approx 1.9\%)1086 / 56796 ( ≈ 1.9 % ) annotations. Moreover, there are scarce annotations regarding timestamps. High-quality temporally-aligned audio-text data is crucial for training temporal controllable models. The more meticulously annotated the data, the better the models can learn the precise correspondence between audio outputs and temporal textual conditions, thereby achieving finer-grained control. Second, the diffusion model has limited knowledge of timestamp information. Existing diffusion-based models aim to learn the relationship between text description and audio event in the audio signal. Although the diffusion models can understand the text instructions at the high level, precise controlling information (e.g. “event-1 at timing-1 … and event-N at timing-N”) is not taken into consideration. This is due the nature of the current design of the diffusion models, which don’t take temporal information into consideration.

Refer to caption

Figure 1: Illustration of controlling timestamp / occurrence frequency of audio events by PicoAudio. It can enable precise controlling of single events or multiple events.

In this work, we propose PicoAudio which enables Precise tImestamp and frequency COntrollability of audio events, by leveraging data simulation111Simulated datasets for training and evaluation are available at https://github.com/PicoAudio/PicoAudio, tailored model designs, and preprocessing with large language model. We focus on timestamp and frequency control, while other temporal conditions (e.g., ordering and interval) can be converted into timestamps through textual reasoning, akin to transforming frequency into timestamps in our experiment. PicoAudio proposes a pipeline to simulate data with temporally-aligned annotations. The pipeline entails crawling data from the Internet, segmenting and filtering audio clips to gather high-quality audio segments, as well as simulating to synthesize realistic audio. PicoAudio introduces tailored modules for temporal control. (a) Timestamp control is accomplished by incorporating customized input, namely timestamp caption. With the assistance of large language model (LLM) [15], (b) frequency control, (c) ordering via multi-event timestamp control and (d) multi-event frequency control can be implemented, as shown in Figure 1. Beyond (a)-(d), PicoAudio can achieve arbitrary precise temporal control as long as the LLM is capable of converting the requirement into timestamp captions, which is straightforward for LLM when prompted with simulated data. Our contributions encompass the following:

  1. 1.

    A data simulation pipeline tailored specifically for temporal controllable audio generation frameworks;

  2. 2.

    A timestamp and frequency controllable generation framework, enabling precise control over sound events;

  3. 3.

    Achieving any temporal control by integrating LLM.

Refer to caption


Figure 2: PicoAudio Flowchart. (Left) illustrates the simulation pipeline, wherein data is crawled from the Internet, segmented and filtered, resulting in one-occurrence segments stored in a database. Pairs of audio, timestamp captions, and frequency captions are simulated from the database. (Right) showcases the model framework. Red arrows indicate the model training process by using the simulated data. Blue arrows indicate inference based on timestamp or frequency captions, where the LLM is prompted with the simulated training data.

2 Temporal Controllable Model

To enable temporal control in audio generation, we first design a simulation pipeline that automatically acquires data and a tailored text processor to enhance audio generative models’ temporal awareness, as shown in Figure 2.

2.1 Temporally-aligned Data Simulation

Data crawling, segmentation & filtering

(1) Audios are crawled from the Internet using event tags as search keywords. These weakly annotated clips possess only sound event tags and may contain noise. (2) A text-to-audio grounding model [16] is employed to segment crawled data, as it can locate the temporal occurrence of events based on input text. Each localized segment encompasses one occurrence of a sound event, such as a “2-seconds cow mooing” segment . For generality, we also define a burst of continuous short sounds as one occurrence, such as a burst of “keyboard ty**” or “door knocking”. (3) To ensure data quality, a contrastive language-audio pretraining (CLAP) model [17] is utilized for further filtering. Thus, we obtain a substantial number of high-quality one-occurrence segments, serving as a one-occurrence database.

Simulation

We randomly select events from the database and synthesize audio by randomly assigning occurrence on-set, following the approach of Xu et al. [18]. The timestamp of occurrence is annotated based on the on-set and the duration recorded in the grounding results. A simulated pair comprises a synthesized audio and a timestamp caption formatted as “event-1 at timing-1 … and event-N at timing-N”, as well as a frequency caption formatted as “event-1 j times … and event-N k times”.

2.2 Text Processor

The standard format makes rule-based transformations very straightforward. The one-hot timestamp matrix 𝒪C×T𝒪superscript𝐶𝑇\mathcal{O}\in\mathbb{R}^{C\times T}caligraphic_O ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_T end_POSTSUPERSCRIPT is derived from the timestamp caption, where C𝐶Citalic_C and T𝑇Titalic_T denote the number of sound events and the time dimension, respectively.

𝒪c,t={1, if event c occurs at time t0, otherwise\mathcal{O}_{c,t}=\left\{\begin{aligned} &1,\text{ if event $c$ occurs at time% $t$}\\ &0,\text{ otherwise}\\ \end{aligned}\right.caligraphic_O start_POSTSUBSCRIPT italic_c , italic_t end_POSTSUBSCRIPT = { start_ROW start_CELL end_CELL start_CELL 1 , if event italic_c occurs at time italic_t end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL 0 , otherwise end_CELL end_ROW (1)

LLM demonstrate excellent performance in text processing tasks. Thanks to LLM, PicoAudio framework can handle various input formats. For example, transforming input “a dog barking occurred between two and three seconds” into the timestamp caption format “dog barking at 2-3”.

LLM also empowers PicoAudio with more capabilities, such as (1) controlling occurrence frequency by transforming “a dog barks three times” into “dog barking at 1-2, 3-4, 7-9”, and (2) ordering by transforming “door knocking then door slamming” into “door knocking at 1-4 and door slamming at 6-8”. The duration of each occurrence is inferred by the LLM based on its own knowledge as well as the examples provided. We supplied GPT-4 with 300300300300 examples in traning set for learning, yielding an initial transformation error rate of 3/1000310003/10003 / 1000 and a refined second transformation error rate of 0/1000010000/10000 / 1000. It can be observed that the transformation is straightforward for LLM when prompted with simulated training data.

PicoAudio employs a CLAP model [17] to extract event information beyond timestamp, denoted as event embedding \mathcal{I}caligraphic_I. As the timestamp caption also encompass semantic information about sound events, which can also be utilized as guidance.

2.3 Audio Representation

PicoAudio employs a Variational Autoencoder (VAE) for audio representation, given the inherent difficulty in directly generating spectrograms. The VAE encoder compresses the audio spectrogram 𝒜T×M𝒜superscript𝑇𝑀\mathcal{A}\in\mathbb{R}^{T\times M}caligraphic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_M end_POSTSUPERSCRIPT into the latent representation 𝒫T/2R×M/2R×D𝒫superscript𝑇superscript2𝑅𝑀superscript2𝑅𝐷\mathcal{P}\in\mathbb{R}^{T/{2^{R}}\times M/{2^{R}}\times D}caligraphic_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_T / 2 start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT × italic_M / 2 start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT × italic_D end_POSTSUPERSCRIPT, where T, M, R, D denote the sequence length, the number of mel bands, the compression ratio and the latent dimension, respectively. 𝒫𝒫\mathcal{P}caligraphic_P is divided into two halves, representing the mean 𝒫μsubscript𝒫𝜇\mathcal{P}_{\mu}caligraphic_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT and variance 𝒫σsubscript𝒫𝜎\mathcal{P}_{\sigma}caligraphic_P start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT in the latent space.

The VAE decoder reconstructs the spectrogram 𝒜~~𝒜\mathcal{\tilde{A}}over~ start_ARG caligraphic_A end_ARG based on samples from the distribution 𝒫~=𝒫μ+𝒫σ𝒩(0,1)~𝒫subscript𝒫𝜇subscript𝒫𝜎𝒩01\mathcal{\tilde{P}}=\mathcal{P}_{\mu}+\mathcal{P}_{\sigma}\cdot\mathcal{N}(0,1)over~ start_ARG caligraphic_P end_ARG = caligraphic_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT + caligraphic_P start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ⋅ caligraphic_N ( 0 , 1 ). The vocoder following the VAE decoder converts the spectrogram back into a waveform.

2.4 Diffusion

PicoAudio utilizes a diffusion model to predict 𝒫~~𝒫\mathcal{\tilde{P}}over~ start_ARG caligraphic_P end_ARG based on the timestamp matrix 𝒪𝒪\mathcal{O}caligraphic_O and event embedding \mathcal{I}caligraphic_I, since it has demonstrated excellent capabilities in audio generation [1, 2, 3, 4, 5].

The diffusion model encompasses the forward steps that transform representation 𝒫𝒫\mathcal{P}caligraphic_P into the Gaussian distribution by noise injection, followed by the reverse steps that progressively denoise. A noise schedule {βn:0<βn<βn+1<1}conditional-setsubscript𝛽𝑛0subscript𝛽𝑛subscript𝛽𝑛11\{\beta_{n}:0<\beta_{n}<\beta_{n+1}<1\}{ italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT : 0 < italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT < italic_β start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT < 1 } defines the Markov chain’s transition probabilities in the forward steps:

q(𝒫n|𝒫n1)𝒩(1βn𝒫n1,βn𝐈)𝑞conditionalsubscript𝒫𝑛subscript𝒫𝑛1𝒩1subscript𝛽𝑛subscript𝒫𝑛1subscript𝛽𝑛𝐈\displaystyle q(\mathcal{P}_{n}|\mathcal{P}_{n-1})\triangleq\mathcal{N}(\sqrt{% 1-\beta_{n}}\mathcal{P}_{n-1},\beta_{n}\mathbf{I})italic_q ( caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | caligraphic_P start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ≜ caligraphic_N ( square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG caligraphic_P start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_I ) (2)
𝒫n=α¯n𝒫0+1α¯nϵnsubscript𝒫𝑛subscript¯𝛼𝑛subscript𝒫01subscript¯𝛼𝑛subscriptitalic-ϵ𝑛\displaystyle\mathcal{P}_{n}=\sqrt{\bar{\alpha}_{n}}\mathcal{P}_{0}+\sqrt{1-% \bar{\alpha}_{n}}\epsilon_{n}caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (3)

where αn=1βn,α¯n=i=1nαiformulae-sequencesubscript𝛼𝑛1subscript𝛽𝑛subscript¯𝛼𝑛superscriptsubscriptproduct𝑖1𝑛subscript𝛼𝑖\alpha_{n}=1-\beta_{n},\bar{\alpha}_{n}=\prod_{i=1}^{n}{\alpha_{i}}italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 - italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, ϵnsubscriptitalic-ϵ𝑛\epsilon_{n}italic_ϵ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT follows distribution ϵ𝒩(0,1)similar-toitalic-ϵ𝒩01\epsilon\sim\mathcal{N}(0,1)italic_ϵ ∼ caligraphic_N ( 0 , 1 ). At last step N𝑁Nitalic_N, 𝒫Nsubscript𝒫𝑁\mathcal{P}_{N}caligraphic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT follows an isotropic Gaussian noise. The model is trained to estimate noise based on input 𝒪𝒪\mathcal{O}caligraphic_O, \mathcal{I}caligraphic_I and a weight λnsubscript𝜆𝑛\lambda_{n}italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT related to Signal-to-Noise Ratio [19]:

=n=1Nλn𝔼ϵn,𝒫0ϵnϵθ([𝒫n,𝒪],)superscriptsubscript𝑛1𝑁subscript𝜆𝑛subscript𝔼subscriptitalic-ϵ𝑛subscript𝒫0normsubscriptitalic-ϵ𝑛subscriptitalic-ϵ𝜃subscript𝒫𝑛𝒪\mathcal{L}=\sum_{n=1}^{N}{\lambda_{n}\mathbb{E}_{\epsilon_{n},\mathcal{P}_{0}% }||\epsilon_{n}-\epsilon_{\theta}([\mathcal{P}_{n},\mathcal{O}],\mathcal{I})||}caligraphic_L = ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | italic_ϵ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( [ caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , caligraphic_O ] , caligraphic_I ) | | (4)

where [,][,][ , ] denotes concatenation, \mathcal{I}caligraphic_I is fused by cross-attention mechanism [20], and ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT denotes the estimation network which can be employed to reconstruct 𝒫~0subscript~𝒫0\mathcal{\tilde{P}}_{0}over~ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from 𝒫~N𝒩(0,1)similar-tosubscript~𝒫𝑁𝒩01\mathcal{\tilde{P}}_{N}\sim\mathcal{N}(0,1)over~ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , 1 ) in the reverse steps with τ¯n=1α¯nsubscript¯𝜏𝑛1subscript¯𝛼𝑛\bar{\tau}_{n}=1-\bar{\alpha}_{n}over¯ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT:

𝒫~n1=1αn(𝒫~nβτ¯nϵθ([𝒫~n,𝒪],))+τ¯n1τ¯nβϵsubscript~𝒫𝑛11subscript𝛼𝑛subscript~𝒫𝑛𝛽subscript¯𝜏𝑛subscriptitalic-ϵ𝜃subscript~𝒫𝑛𝒪subscript¯𝜏𝑛1subscript¯𝜏𝑛𝛽italic-ϵ\mathcal{\tilde{P}}_{n-1}=\frac{1}{\sqrt{\alpha_{n}}}(\mathcal{\tilde{P}}_{n}-% \frac{\beta}{\sqrt{\bar{\tau}_{n}}}\epsilon_{\theta}(\mathcal{[\tilde{P}}_{n},% \mathcal{O}],\mathcal{I}))+\sqrt{\frac{\bar{\tau}_{n-1}}{\bar{\tau}_{n}}\beta}\epsilonover~ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG ( over~ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - divide start_ARG italic_β end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( [ over~ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , caligraphic_O ] , caligraphic_I ) ) + square-root start_ARG divide start_ARG over¯ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG italic_β end_ARG italic_ϵ (5)

3 Experiment

3.1 Data Simulation

Audio clips are crawled from Freesound222https://freesound.org/ using sound event as search keywords. Segmentation and filtering are conducted by a text-to-audio grounding model [21] and LAION-CLAP [17] with threshold set to 0.50.50.50.5 and 0.30.30.30.3, respectively. The collection process results in a total of 636636636636 high-quality one-occurrence segments containing 18181818 sound events. During simulation, the sound events and on-set time are randomly assigned, with the proportion of 1111, 2222, and 3333 occurrences for each sound event being approximately 2:2:1:22:12:2:12 : 2 : 1. A total of 5000,400,20050004002005000,400,2005000 , 400 , 200 clips are simulated for training, single-event testing and multi-event testing, respectively.

Four temporal control tasks are designed: (a) single-event timestamp control using timestamp caption as input; (b) single-event frequency control using the frequency caption “xx k times” as input, which is directly fed into the baseline models. GPT-4 predicts the duration of segments and subsequently converts frequency captions into timestamp captions before feeding them into PicoAudio. (c) multi-event timestamp and (d) multi-event frequency control employ captions with multiple events.

3.2 Experiment Setup

The time resolution in the timestamp matrix is set to 40404040 ms, which implies that temporal control can be achieved with precision at the millisecond level. The LAION-CLAP [17] is utilized as the event embedding extractor. PicoAudio adopts a pre-trained VAE model following Liu et al. [8]. The diffusion model employs a structure similar to Ghosal et al. [5] but with fewer parameters, with attention dimensions {4,8,16,16}481616\{4,8,16,16\}{ 4 , 8 , 16 , 16 }, block channels {128,256,512,512}128256512512\{128,256,512,512\}{ 128 , 256 , 512 , 512 }, and input channels 10101010 (2222 for the timestamp matrix). HiFi-GAN vocoder is used to transforms spectrogram back to waveform.

PicoAudio is trained for 40404040 epochs with a learning rate set to 3×1053superscript1053\times 10^{-5}3 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and decreasing according to a linear decay scheduler. VAE, LAION-CLAP and HiFi-GAN vocoder are frozen during trainging. The AdamW optimizer is utilized. During inference, the Classifier-free guidance scale is set to 3333 [22, 23].

Table 1: Evaluation results. F1segmentsegment{}_{\text{segment}}start_FLOATSUBSCRIPT segment end_FLOATSUBSCRIPT / 𝑳𝟏freqsuperscriptsubscript𝑳1freq\bm{L_{1}^{\text{freq}}}bold_italic_L start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT freq end_POSTSUPERSCRIPT respectively measures the timestamp alignment / occurrence frequencies error between generated audio and input conditions. FAD measures the audio quality. MOS denotes subjective metrics. Ablation study: “w/o T” indicates that the model does not utilize timestamp matrix 𝒪𝒪\mathcal{O}caligraphic_O, which shares a similar framework with the baseline models.
Condition Timestamp Occurrence Frequency
Metrics F1segmentsegment{}_{\text{segment}}start_FLOATSUBSCRIPT segment end_FLOATSUBSCRIPT MOScontrolcontrol{}_{\text{control}}start_FLOATSUBSCRIPT control end_FLOATSUBSCRIPT FAD\downarrow MOSqualityquality{}_{\text{quality}}start_FLOATSUBSCRIPT quality end_FLOATSUBSCRIPT L1freqsuperscriptsubscript𝐿1freqabsentL_{1}^{\text{freq}}\downarrowitalic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT freq end_POSTSUPERSCRIPT ↓ MOScontrolcontrol{}_{\text{control}}start_FLOATSUBSCRIPT control end_FLOATSUBSCRIPT FAD\downarrow MOSqualityquality{}_{\text{quality}}start_FLOATSUBSCRIPT quality end_FLOATSUBSCRIPT
Single Event Ground Truth 0.797 4.78 0 4.44 0.302 4.9 0 4.38
AudioLDM2 0.675 2.14 10.853 3.34 2.408 2.3 20.677 3.68
Amphion 0.566 1.98 11.774 2.82 2.060 2.22 11.999 3.54
PicoAudio w/o T 0.694 2.78 5.926 4.2 1.25 2.92 5.923 4.2
PicoAudio (Ours) 0.783 4.58 3.175 4.16 0.537 4.92 2.295 4.1
Multiple Events Ground Truth 0.787 4.6 0 4.38 0.447 4.68 0 4.56
AudioLDM2 0.593 1.82 10.112 2.36 2.046 2.14 18.334 2.3
Amphion 0.520 2.2 10.979 2.72 1.851 2.48 11.769 3.24
PicoAudio w/o T 0.614 2.12 5.218 3.42 1.216 2.1 5.215 3.3
PicoAudio (Ours) 0.772 4.84 2.863 4.12 0.713 4.6 2.1823 4.38

3.3 Evaluation

Both subjective and objective evaluation metrics are introduced to conduct comprehensive assessments.

Subjective

Mean Opinion Score (MOS) are conducted from two perspectives: audio quality and temporal controllability. Audio quality considers the naturalness, distortion, and event accuracy of the generated audio. Temporal controllability evaluates the accuracy of timestamp / frequency control. For each task, 5555 audio clips from each model are rated by 10101010 evaluators, and the mean score is calculated. All evaluators are screened for no hearing loss and have university-level education from prestigious universities, using designated headphones.

Objective

The commonly used FAD in audio generation tasks is utilized to assess the quality of generated audio [24]. The temporal condition in the timestamp / frequency caption is used as the ground truth for evaluation. A grounding model [21] is employed to detect the on- and off-sets of segments in generated audio. (a) For the timestamp control task, the accuracy of the detected segments is assessed by the segment F1 score [25], a commonly used metric in sound event detection. (b) For the frequency control task, accuracy is measured by the absolute difference between the specified frequency in the caption and the detected frequency in the audio. The difference is averaged on test samples N𝑁Nitalic_N and number of class C𝐶Citalic_C, denoted as L1freqsuperscriptsubscript𝐿1freqL_{1}^{\text{freq}}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT freq end_POSTSUPERSCRIPT:

L1freq=1NCn=1Nc=1C|#specified#detected|superscriptsubscript𝐿1freq1𝑁𝐶superscriptsubscript𝑛1𝑁superscriptsubscript𝑐1𝐶#𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑒𝑑#𝑑𝑒𝑡𝑒𝑐𝑡𝑒𝑑L_{1}^{\text{freq}}=\frac{1}{N*C}\sum_{n=1}^{N}{\sum_{c=1}^{C}{|\#specified-\#% detected|}}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT freq end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N ∗ italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT | # italic_s italic_p italic_e italic_c italic_i italic_f italic_i italic_e italic_d - # italic_d italic_e italic_t italic_e italic_c italic_t italic_e italic_d | (6)

Simulated audios in the test set are utilized as the ground truth to obtain an objective upper bound, since grounding model cannot detect and localize audio events with 100%percent100100\%100 % accuracy.

4 Result

The control of timestamp and frequency are evaluated separately on both single-event and multiple-event test sets. The results are presented in Table 1. Two mainstream audio generation models, AudioLDM2 [3] and Amphion [26, 9], are employed as baselines. Both subjective and objective metrics demonstrate that PicoAudio surpasses baseline models.

4.1 Timestamp & Occurrence Frequency Control

The timestamp controlled audios generated by PicoAudio are very close to the ground truth (upper bound), demonstrating the precision of control, whether in single-event or multi-event tasks. PicoAudio introduces tailored modules to convert the textual timestamp information into a timestamp matrix, achieving exact control of timestamp in the generated audio at a time resolution of 40404040 ms. Equipped with prompted GPT-4, PicoAudio demonstrates outstanding performance in the frequency error metric L1freqsuperscriptsubscript𝐿1freqL_{1}^{\text{freq}}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT freq end_POSTSUPERSCRIPT. Even in the presence of grounding detection omission errors, it achieves an average error rate of 0.5370.5370.5370.537 / 0.7130.7130.7130.713 occurrences per sound event on the single-event / multi-event tasks, respectively. Achieving L1freqsuperscriptsubscript𝐿1freqL_{1}^{\text{freq}}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT freq end_POSTSUPERSCRIPT less than 1111, akin to the grounding truth, implies that PicoAudio has demonstrated practicality in frequency controlling.

Mainstream generative baseline models, however, fall slightly short in performance. They obtain lower F1segmentsegment{}_{\text{segment}}start_FLOATSUBSCRIPT segment end_FLOATSUBSCRIPT scores and produce a frequency error around 2222 times per event, as they tend to excessively replicate events when faced with temporal conditions. Furthermore, the ablation study employs a model trained on simulated data without using timestamp matrix 𝒪𝒪\mathcal{O}caligraphic_O, which shares a similar framework with the baseline models. The ablation results lie between the baseline models and PicoAudio, indicating that achieving precise control requires not only temporally-aligned audio-text data but also specific model design.

4.2 Arbitrary Temporal Control Capabilities

With the powerful text processing capabilities of LLM, PicoAudio’s precise timestamp control capability provides infinite possibilities for temporal control. For instance, for temporal interval and duration control, expressions like “dog barks three times, with a 2-second interval / duration each time” can be transformed into single-event timestamp control. For events ordering, phrases like “dog barks then gunshot” can be transformed into multi-event timestamp control. Converting temporal control requirements into timestamp caption format is straightforward for GPT-4 after being prompted. Therefore, it can be said that the PicoAudio can achieve arbitrary precise temporal control.

However, due to constraints imposed by the audio sources, PicoAudio’s limitation lies in its temporary capacity to exercise temporal control over a limited number of events. Expanding the quantity of events and achieving comprehensive control beyond temporal are among our future research directions.

4.3 Audio Quality

Both the subjective metric MOSqualityquality{}_{\text{quality}}start_FLOATSUBSCRIPT quality end_FLOATSUBSCRIPT and the objective metric FAD demonstrate that PicoAudio outperforms the baseline models. On one hand, PicoAudio benefits from the advantage of having both the training and test sets derived from simulated data, whereas baseline models have not been trained on such data. On the other hand, as mentioned earlier, baseline models tend to excessively replicate events when confronted with temporal control, leading to significant discrepancies with the distribution of the test set. The ablation experiment demonstrates that solely employing mainstream baseline frameworks trained on simulated data yields limited improvements in audio quality. Timestamp information aids model in better discerning the distribution of audio.

5 Conclusion

Significant progress has been made in audio generation tasks, but performance in terms of temporal control remains subpar, primarily due to the lack of datasets with fine-grained annotations and specific model designs. PicoAudio addresses this issue by acquiring data with fine-grained timestamp annotation through web crawling, segmentation, filtering and simulation. In terms of model design, PicoAudio utilizes tailored modules to handle temporal information. It converts captions into one-hot matrices, assisting the diffusion model in achieving 40404040 ms level control over timestamp. In evaluations encompassing controllability and quality, PicoAudio outperforms mainstream models in both subjective and objective metrics. With the support of GPT-4’s powerful text processing capabilities, PicoAudio can achieve a variety of temporal control capabilities, including frequency control, interval control, events ordering, etc. While PicoAudio’s limitation lies in its control over a limited number of events, this serves as a direction for our future work.

References

  • [1] F. Kreuk, G. Synnaeve, A. Polyak, U. Singer, A. Défossez, J. Copet, D. Parikh, Y. Taigman, and Y. Adi, “Audiogen: Textually guided audio generation,” in The Eleventh International Conference on Learning Representations, 2022.
  • [2] D. Yang, J. Yu, H. Wang, W. Wang, C. Weng, Y. Zou, and D. Yu, “Diffsound: Discrete diffusion model for text-to-sound generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  • [3] H. Liu, Q. Tian, Y. Yuan, X. Liu, X. Mei, Q. Kong, Y. Wang, W. Wang, Y. Wang, and M. D. Plumbley, “Audioldm 2: Learning holistic audio generation with self-supervised pretraining,” arXiv preprint arXiv:2308.05734, 2023.
  • [4] J. Huang, Y. Ren, R. Huang, D. Yang, Z. Ye, C. Zhang, J. Liu, X. Yin, Z. Ma, and Z. Zhao, “Make-an-audio 2: Temporal-enhanced text-to-audio generation,” arXiv preprint arXiv:2305.18474, 2023.
  • [5] D. Ghosal, N. Majumder, A. Mehrish, and S. Poria, “Text-to-audio generation using instruction-tuned llm and latent diffusion model,” arXiv preprint arXiv:2304.13731, 2023.
  • [6] A. Vyas, B. Shi, M. Le, A. Tjandra, Y.-C. Wu, B. Guo, J. Zhang, X. Zhang, R. Adkins, W. Ngan et al., “Audiobox: Unified audio generation with natural language prompts,” arXiv preprint arXiv:2312.15821, 2023.
  • [7] D. Yang, J. Tian, X. Tan, R. Huang, S. Liu, X. Chang, J. Shi, S. Zhao, J. Bian, X. Wu et al., “Uniaudio: An audio foundation model toward universal audio generation,” arXiv preprint arXiv:2310.00704, 2023.
  • [8] H. Liu, Z. Chen, Y. Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley, “Audioldm: Text-to-audio generation with latent diffusion models,” in International Conference on Machine Learning.   PMLR, 2023, pp. 21 450–21 474.
  • [9] Y. Wang, Z. Ju, X. Tan, L. He, Z. Wu, J. Bian et al., “Audit: Audio editing by following instructions with latent diffusion models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [10] Y. Chung, J. Lee, and J. Nam, “T-foley: A controllable waveform-domain diffusion model for temporal-event-guided foley sound synthesis,” arXiv preprint arXiv:2401.09294, 2024.
  • [11] Z. Guo, J. Mao, R. Tao, L. Yan, K. Ouchi, H. Liu, and X. Wang, “Audio generation with multiple conditional diffusion model,” arXiv preprint arXiv:2308.11940, 2023.
  • [12] Z. Evans, C. Carr, J. Taylor, S. H. Hawley, and J. Pons, “Fast timing-conditioned latent audio diffusion,” arXiv preprint arXiv:2402.04825, 2024.
  • [13] C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 119–132.
  • [14] K. Drossos, S. Lip**, and T. Virtanen, “Clotho: An audio captioning dataset,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 736–740.
  • [15] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  • [16] X. Xu, H. Dinkel, M. Wu, and K. Yu, “Text-to-audio grounding: Building correspondence between captions and sound events,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 606–610.
  • [17] Y. Wu*, K. Chen*, T. Zhang*, Y. Hui*, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023.
  • [18] X. Xu, X. Xu, Z. Xie, P. Zhang, M. Wu, and K. Yu, “A detailed audio-text data simulation pipeline using single-event sounds,” arXiv preprint arXiv:2403.04594, 2024.
  • [19] T. Hang, S. Gu, C. Li, J. Bao, D. Chen, H. Hu, X. Geng, and B. Guo, “Efficient diffusion training via min-snr weighting strategy,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV).   IEEE Computer Society, 2023, pp. 7407–7417.
  • [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [21] X. Xu, Z. Ma, M. Wu, and K. Yu, “Towards weakly supervised text-to-audio grounding,” arXiv preprint arXiv:2401.02584, 2024.
  • [22] J. Ho and T. Salimans, “Classifier-free diffusion guidance,” in NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  • [23] A. Q. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. Mcgrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” in International Conference on Machine Learning.   PMLR, 2022, pp. 16 784–16 804.
  • [24] K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms.” in INTERSPEECH, 2019, pp. 2350–2354.
  • [25] A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for polyphonic sound event detection,” Applied Sciences, vol. 6, no. 6, p. 162, 2016.
  • [26] X. Zhang, L. Xue, Y. Gu, Y. Wang, H. He, C. Wang, X. Chen, Z. Fang, H. Chen, J. Zhang, T. Y. Tang, L. Zou, M. Wang, J. Han, K. Chen, H. Li, and Z. Wu, “Amphion: An open-source audio, music and speech generation toolkit,” arXiv, vol. abs/2312.09911, 2024.