\interspeechcameraready\name

AtsushiAndo \nameTakafumiMoriya \nameShotaHoriguchi \nameRyoMasumura

Factor-Conditioned Speaking-Style Captioning

Abstract

This paper presents a novel speaking-style captioning method that generates diverse descriptions while accurately predicting speaking-style information. Conventional learning criteria directly use original captions that contain not only speaking-style factor terms but also syntax words, which disturbs learning speaking-style information. To solve this problem, we introduce factor-conditioned captioning (FCC), which first outputs a phrase representing speaking-style factors (e.g., gender, pitch, etc.), and then generates a caption to ensure the model explicitly learns speaking-style factors. We also propose greedy-then-sampling (GtS) decoding, which first predicts speaking-style factors deterministically to guarantee semantic accuracy, and then generates a caption based on factor-conditioned sampling to ensure diversity. Experiments show that FCC outperforms the original caption-based training, and with GtS, it generates more diverse captions while kee** style prediction performance.

keywords:
speaking-style captioning, large language models, pre-trained speech encoder, factor-conditioned captioning

1 Introduction

Understanding the speaking styles of speakers, including the speaker’s attributes and prosodic information, is important in speech applications. These applications include flexible control in speech synthesis [1, 2, 3] or voice conversion [4], understanding and controlling speaking behavior for human-like spoken dialogue systems [5, 6], and supporting other speech processing technologies, such as using auxiliary labels for recognizing non-linguistic or para-linguistic information [7, 8].

Traditional studies formulate speaking-style recognition as audio classification into pre-defined speaking-style classes such as style categories, e.g., spontaneous or infant-directed [9, 10]. However, their applicability is limited due to the difficulty of predefining diverse speaking styles in real environments. In response, recent studies have tackled speaking-style captioning, which predicts speaking styles in a free descriptive format [11, 12, 13]. This eliminates the need to predefine speaking-style classes, since detailed speaking-style information, such as intensity and detailed expressions, can be used instead. Conventional style captioning models are composed of a speech encoder, bridge network, and text decoder. The speech encoder first extracts speech features from the input audio; the bridge network then converts these features into representations suitable for input to the text decoder, which generates a speaking-style caption from the representations. Pre-trained autoregressive large language models (LLMs) are used as the text decoder to generate natural and diverse captions.

One of the challenges of speaking-style captioning is to generate captions with accurate speaking-style factors, such as information of gender, pitch, power, and speaking speed. However, using the conventional training method based on cross-entropy with ground-truth captions can make it difficult to ensure that the captioning model acquires speaking-style knowledge. The ground-truth captions contain not only speaking-style factor terms but also syntax words, which disturbs to focus solely on learning the correct speaking styles. Furthermore, even in the inference step, the conventional models suffer from errors in the prediction of speaking style factors if sampling-based decoding [14] is used. While such decoding is widely used to ensure caption diversity, it can lead to the occurrence of less probable words, potentially generating words that represent incorrect style factors.

This paper presents a simple but effective model training method named factor-conditioned captioning (FCC) that can predict speaking-style factors accurately. The key idea of FCC is to make the captioning model explicitly predict only style factors as intermediate reasoning steps before generating style captions, like the Chain-of-Thought Prompting approach [15]. The ground truth of FCC contains a factor phrase, the phrase representing speaking style factors, whose format is fixed, e.g., “male, low pitch, high volume, normal speed”, in the first part, and the original caption in the second part. The factor phrase can be generated by predicting the speaking-style factors from the original caption using GPT [16], or by the ground-truth factors if available. Our FCC-based training proposal forces the text decoder to consider style factors in generating captions since the decoder predicts the next word based on the preceding text. One of the advantages of FCC is that it is applicable simply by changing the output sentence in the training step, regardless of the model structure or loss function.

For decoding, we introduce greedy-then-sampling (GtS) to generate diverse captions while considering the style factors. GtS predicts the factor phrase by using a maximum likelihood criterion, followed by random sampling for the caption part. This enables the generation of diverse captions in terms of expressions and syntax while the conditioning factors are determined by the hypothesis with the highest probability, which prevents the generalization of incorrect style factors.

Experiments on the PromptTTS dataset [1] demonstrate that FCC significantly improves the performance of both the speaking-style captioning and individual speaking-style factor classification derived from captions compared to the conventional method. Furthermore, FCC with GtS decoding yields more diverse captions while preserving accuracy of speaking style factor recognition. Note that we conduct comprehensive comparisons of the components of the captioning model to make a strong baseline for speaking style captioning, and thus validate the superior effectiveness of the proposed methods.

2 Related work

Speaking-style captioning is often included as a task within speech understanding, enabling a single model to address multiple speech-related tasks. Several prior studies proposed speech understanding models[12, 13, 17, 18, 19, 20, 21, 22]. Though the previous studies employed various types of modules in the individual components, few have reported comparisons of the components. To rectify this omission, this study conducts comprehensive evaluations of multiple methods for each of the three components in speaking-style captioning to investigate the optimal model structure and how each component affects the final result. Note that the proposed methods, including FCC and GtS, show significant improvements in the best model.

3 Proposed speaking-style captioning

Let 𝑺,𝑰𝑺𝑰\bm{S},\bm{I}bold_italic_S , bold_italic_I be an input speech and an input instruction sentence to specify the task, with 𝑶𝑶\bm{O}bold_italic_O is being corresponding output sentence. Speaking-style captioning is formulated as the problem of predicting output sentence 𝑶𝑶\bm{O}bold_italic_O from 𝑺𝑺\bm{S}bold_italic_S and 𝑰𝑰\bm{I}bold_italic_I,

𝑶^=argmax𝑶P(𝑶𝑺,𝑰;𝚯),bold-^𝑶subscriptargmax𝑶𝑃conditional𝑶𝑺𝑰𝚯\bm{\hat{O}}=\mathop{\rm arg~{}max}\limits_{\bm{O}}P\left(\bm{O}\mid\bm{S},\bm% {I};\bm{\Theta}\right),overbold_^ start_ARG bold_italic_O end_ARG = start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT bold_italic_O end_POSTSUBSCRIPT italic_P ( bold_italic_O ∣ bold_italic_S , bold_italic_I ; bold_Θ ) , (1)

where 𝚯𝚯\bm{\Theta}bold_Θ is the set of captioning model parameters. The instruction sentence is fixed to learn a single task, e.g., speaking style captioning.

3.1 Captioning model

Similar to the conventional studies [11, 22, 18, 12, 19], the captioning model consists of three blocks: speech encoder, bridge network, and text decoder, as shown in Figure 1.

The speech encoder extracts multiple sequences of speech representation vectors 𝒁=[𝒛1,,𝒛K]𝒁subscript𝒛1subscript𝒛𝐾\bm{Z}=[\bm{z}_{1},\dots,\bm{z}_{K}]bold_italic_Z = [ bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_z start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ] from the input speech,

𝒁=𝖲𝗉𝖾𝖾𝖼𝗁𝖤𝗇𝖼(𝑺;𝚯s),𝒁𝖲𝗉𝖾𝖾𝖼𝗁𝖤𝗇𝖼𝑺subscript𝚯𝑠\bm{Z}=\mathsf{SpeechEnc}\left(\bm{S};\bm{\Theta}_{s}\right),bold_italic_Z = sansserif_SpeechEnc ( bold_italic_S ; bold_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , (2)

where 𝒛kV×Tsubscript𝒛𝑘superscript𝑉𝑇\bm{z}_{k}\in\mathbb{R}^{V\times T}bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_V × italic_T end_POSTSUPERSCRIPT denotes the T𝑇Titalic_T-length sequence of V𝑉Vitalic_V-dimensional vectors output from the k𝑘kitalic_k-th layer of the speech encoder, which consists of K𝐾Kitalic_K layers. 𝚯ssubscript𝚯𝑠\bm{\Theta}_{s}bold_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT denotes the set of parameters of the speech encoder that are pre-trained and frozen during the training of the captioning model. The use of multiple intermediate layers enables the model to extract different types of speech information [23, 24].

The bridge network converts the speech representations into speech embeddings suitable for input to the text encoder:

𝒆=𝖡𝗋𝗂𝖽𝗀𝖾(𝒁;𝚯b),𝒆𝖡𝗋𝗂𝖽𝗀𝖾𝒁subscript𝚯𝑏\displaystyle\bm{e}=\mathsf{Bridge}\left(\bm{Z};\bm{\Theta}_{b}\right),bold_italic_e = sansserif_Bridge ( bold_italic_Z ; bold_Θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) , (3)

where 𝒆D×N𝒆superscript𝐷𝑁\bm{e}\in\mathbb{R}^{D\times N}bold_italic_e ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_N end_POSTSUPERSCRIPT is the D𝐷Ditalic_D-length N𝑁Nitalic_N-dimensional vector. 𝚯bsubscript𝚯𝑏\bm{\Theta}_{b}bold_Θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT represents the set of bridge network parameters. As the bridge network, we use a time and layer-wise Transformer (TLTR) [25] so as to combine multi-level information from different layers of the speech encoders into a V𝑉Vitalic_V-dimensional embedding of length 1, i.e., N𝑁Nitalic_N=1. Finally, the embedding is projected into the same dimensions as the input of the text decoder by a fully connected layer.

The text decoder predicts the posterior probabilities of the output sentence from the speech embeddings 𝒆𝒆\bm{e}bold_italic_e and embedded input instruction sentence 𝒗𝒗\bm{v}bold_italic_v as

𝒗𝒗\displaystyle\bm{v}bold_italic_v =𝖳𝖾𝗑𝗍𝖤𝗆𝖻(𝖳𝗈𝗄𝖾𝗇𝗂𝗓𝖾𝗋(𝑰);𝚯t),absent𝖳𝖾𝗑𝗍𝖤𝗆𝖻𝖳𝗈𝗄𝖾𝗇𝗂𝗓𝖾𝗋𝑰subscript𝚯𝑡\displaystyle=\mathsf{TextEmb}\left(\mathsf{Tokenizer}\left(\bm{I}\right);\bm{% \Theta}_{t}\right),= sansserif_TextEmb ( sansserif_Tokenizer ( bold_italic_I ) ; bold_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (4)
P(𝑶)𝑃𝑶\displaystyle P\left(\bm{O}\right)italic_P ( bold_italic_O ) =𝖳𝖾𝗑𝗍𝖣𝖾𝖼(𝒗,𝒆;𝚯d),absent𝖳𝖾𝗑𝗍𝖣𝖾𝖼𝒗𝒆subscript𝚯𝑑\displaystyle=\mathsf{TextDec}\left(\bm{v},\bm{e};\bm{\Theta}_{d}\right),= sansserif_TextDec ( bold_italic_v , bold_italic_e ; bold_Θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , (5)

where 𝒗D×M𝒗superscript𝐷𝑀\bm{v}\in\mathbb{R}^{D\times M}bold_italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_M end_POSTSUPERSCRIPT and M𝑀Mitalic_M is the embedding sequence length. 𝚯t,𝚯dsubscript𝚯𝑡subscript𝚯𝑑\bm{\Theta}_{t},\bm{\Theta}_{d}bold_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT represent the set of parameters for the text embedding layer and the text decoder, respectively. The proposed method employs an autoregressive text decoder and the corresponding text tokenizer derived from a pre-trained LLM such as LLaMA-2 [26]. Furthermore, in the text decoder, an adapter based on low-rank adaptation (LoRA) [27] is inserted into the original decoder to adapt the output sentences of the training data. That is, the set of the text decoder parameters composed of the parameter sets of the adapter 𝚯dasubscript𝚯subscript𝑑𝑎\bm{\Theta}_{d_{a}}bold_Θ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT and the original LLM decoder 𝚯dosubscript𝚯subscript𝑑𝑜\bm{\Theta}_{d_{o}}bold_Θ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_POSTSUBSCRIPT, where 𝚯d=𝚯do𝚯dasubscript𝚯𝑑subscript𝚯subscript𝑑𝑜subscript𝚯subscript𝑑𝑎\bm{\Theta}_{d}={\bm{\Theta}_{d_{o}}\cup\bm{\Theta}_{d_{a}}}bold_Θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = bold_Θ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∪ bold_Θ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT. The parameters of 𝚯tsubscript𝚯𝑡\bm{\Theta}_{t}bold_Θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝚯dosubscript𝚯subscript𝑑𝑜\bm{\Theta}_{d_{o}}bold_Θ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_POSTSUBSCRIPT are frozen during training, while 𝚯dasubscript𝚯subscript𝑑𝑎\bm{\Theta}_{d_{a}}bold_Θ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT are updated.

Refer to caption
Figure 1: The outline of the proposal, FCC.

3.2 Training by factor-conditioned captioning

Refer to caption
Figure 2: Generation of the ground-truth output in FCC.

To help the captioning model learn speaking-style factors explicitly in addition to syntactic information, FCC uses the extended version of the original captions as ground truth. Each extended caption consists of two parts: factor phrase, i.e., comma-separated terms, each of which represents a speaking-style factor, as the first part, and the original caption as the second part. The use of this output enables the captioning model to predict the speaking-style information derived from speech first and generate the caption with explicit consideration of speaking style factors since the text decoder is autoregressive.

Readers may find it difficult to pre-define such factors, but this study avoids that difficulty by automatically generating them from the original captions alone. As shown in Figure 2, speaking-style factors are predicted by GPT [16] with a specific prompt111Please select the information contained in the input sentence. The options are as follows:<br>Gender: male or female<br> Pitch: low, normal, high (normal if unspecified)<br>Volume: low, normal, high (normal if unspecified)<br>Speaking Speed: slow, normal, fast (normal if unspecified)<br><br>Input: {caption}.”, where <br> represents line break., and so a template is used to create the factor phrase. The factors unspecified in the caption are regarded as normal. The template of the factor phrase is fixed during training, which makes the captioning model focus on predicting the factor information. Of course, if ground-truth style factor labels are available, they can be used instead of predictions by GPT.

3.3 Greedy-then-sampling decoding

The common sampling strategies are not suitable for FCC because there is a risk of outputting style factors with low probability, which will yield caption errors. Thus, we also propose GtS, which is a decoding strategy for FCC in generating captions with diverse expressions. In GtS, decoding begins with a greedy search criterion until a delimiter token that connects the factors and caption (‘style:’ in Figure 1) is generated. Then, it switches to a sampling criterion [14] to generate the main speaking style caption. The greedy search ensures the correctness of the style factors by simply using those of the highest posterior probabilities, while the latter sampling allows the generation of diverse captions constrained on the predicted style factors.

4 Experiments

4.1 Setup

The PromptSpeech dataset provided by PromptTTS [1] was used for our evaluations. The audio is from the English Audiobook corpus, LibriTTS [28], and the labels include annotated speaking-style captions, as well as discrete speaking-style factors for gender (male/female), pitch (high/normal/low), volume (high/normal/low), and speaking speed (fast/normal/slow). The dataset was split into the official speaker-open subsets as in the conventional work of [11]. The training set consists of 24,953 utterances by 1,113 speakers, the development set of 857 utterances by 40 speakers, and the test set of 778 utterances by 38 speakers. The distributions of each class of the style factors were roughly equal. Note that from the total of 26,588 utterances, there were 154 unique captions.

The speech encoder was Whisper large-v3 [29] (denote as Whisper-L), consisting of 32 intermediate layers with 1,280 units each; the input was fixed to 30 seconds, i.e., 1,500 frames. As TLTR in the bridge network (TLTR-utt), initial pooling of intermediate layer outputs was performed every 20 frames, and both the time and layer-wise Transformers were single-layer Transformers with 512 hidden units. The text decoder was LLaMA-2 7B-chat [26], and the LoRA was the attention-based method [27] with r𝑟ritalic_r=8 and α𝛼\alphaitalic_α=32 (7B-chat-LoRA). The trainable/total parameters in each component were 0M/637M in the speech encoder, 90.7M/90.7M in the bridge network, and 4.2M/6.7B in the text decoder, respectively. The conventional method was StyleCap [11] consisting of WavLM-base-plus [23] speech encoder, weighted-sum + BLSTM + Q-Former [30]-based bridge network, and LLaMA-2-7B decoder without LoRA. The batch size was 16. The optimizer was AdamW [31] with learning rate of 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. The number of epochs was set to 5, with early stop** based on the development set. Note that we did not pre-train the bridge network using the close-ended style classification tasks [12]222Our preliminary experiments found there were no statistically significant improvements with bridge-network pre-training.. The baseline methods were the models trained by the original caption with greedy or sampling-based decoding. Two methods of creating FCC were evaluated: the case when factor phrases were created using the predicted labels using GPT (denoted as w/ predicted factors), and those created using discrete speaking-style factor labels provided in the dataset (w/ golden factors). The gpt-4-0613 model with the same prompt as described in Section 3.2 was used for the former FCC. We evaluated the proposed FCC with greedy, sampling, and GtS decoding. In the inference step, the thresholds of the probability and the number were set to 0.9 and 40, respectively, for both sampling and GtS decoding.

The evaluation tasks included not only speaking-style captioning but also style factor classification. The style factor classification involved using GPT to estimate factor classes for gender, pitch, volume, and speaking rate from captions, and then evaluating the accuracy against annotated discrete labels. As the map** from caption to four speaking style factors, the predicted caption was given to the gpt-4-0613 model with the same prompt as described in Section 3.2, then the output sentence was parsed and mapped to several non-target terms like ‘unspecified’ to ‘normal’ in post-processing. Note that we created a codebook of GPT input/output pairs to ensure that the same factor prediction results for the same captions were returned in all the experiments to avoid the effect of sampling-based decoding on GPT. The evaluation metrics for speaking-style captioning were BLEU-4 (B@4), ROUGE-L (ROU), METEOR (MET), BERTScore with distilbert-base-uncased [32] (BS), and distinct-1 and 2 (dis-1/2), as a subset of the baseline study [11]. The first four metrics were implemented by Hugging Face evaluate [33].

4.2 Results

4.2.1 Main results

Table 1: Performance of speaking-style captioning and accuracy (%) of style factor classification from captions using GPT. The baselines are our model, trained by original captions with greedy or sampling-based decoding. The best results are highlighted in bold. Those metrics calculated using the ground-truth captions are also shown at the bottom for reference.
\eightpt
\renewrobustcmd
\renewrobustcmd
\newrobustcmd
\B
Speaking-style captioning Style factor classification
Model Output Decoding B@4\uparrow ROU\uparrow MET\uparrow BS\uparrow dis-1\uparrow dis-2\uparrow Gender Pitch Speed Volume Avg.
StyleCap [11] Caption Greedy 0.2730.2730.2730.273 0.4970.4970.4970.497 0.4690.4690.4690.469 0.8550.8550.8550.855 0.0230.0230.0230.023 0.0730.0730.0730.073 91.091.091.091.0 61.061.061.061.0 85.285.285.285.2 69.969.969.969.9 76.876.876.876.8
Ours Caption Greedy 0.6130.6130.6130.613 0.7100.7100.7100.710 0.6980.6980.6980.698 0.9250.9250.9250.925 0.0170.0170.0170.017 0.0510.0510.0510.051 97.997.997.997.9 91.491.491.491.4 93.393.393.393.3 97.997.997.997.9 95.195.195.195.1
(Whisper-L,TLTR-utt,7B-chat-LoRA)Whisper-L,TLTR-utt,7B-chat-LoRA\left(\begin{array}[]{l}\text{Whisper-L,}\\ \text{TLTR-utt,}\\ \text{7B-chat-LoRA}\end{array}\right)( start_ARRAY start_ROW start_CELL Whisper-L, end_CELL end_ROW start_ROW start_CELL TLTR-utt, end_CELL end_ROW start_ROW start_CELL 7B-chat-LoRA end_CELL end_ROW end_ARRAY ) Sampling 0.5940.5940.5940.594 0.6910.6910.6910.691 0.6790.6790.6790.679 0.9200.9200.9200.920 \B0.0200.0200.0200.020 0.0690.0690.0690.069 97.897.897.897.8 88.788.788.788.7 91.791.791.791.7 98.698.698.698.6 94.294.294.294.2
FCC Greedy 0.6380.6380.6380.638* 0.7260.7260.7260.726 0.7140.7140.7140.714* 0.9300.9300.9300.930* 0.0160.0160.0160.016 0.0470.0470.0470.047 98.398.398.398.3 92.292.292.292.2 94.294.294.294.2 98.598.598.598.5 95.895.895.895.8
(w/ predicted factors) Sampling 0.6070.6070.6070.607 0.6940.6940.6940.694 0.6870.6870.6870.687 0.9220.9220.9220.922 \B0.0200.0200.0200.020 \B0.0700.0700.0700.070 98.798.798.798.7 89.989.989.989.9 \B94.594.594.594.5 98.698.698.698.6 95.495.495.495.4
GtS \B0.6400.6400.6400.640+ 0.727+ 0.716+ 0.9290.9290.9290.929+ \B0.0200.0200.0200.020 0.0700.0700.0700.070 \B99.099.099.099.0 \B93.293.293.293.2+ \B94.594.594.594.5+ \B99.299.299.299.2 \B96.596.596.596.5
FCC Greedy 0.6380.6380.6380.638* \B0.7270.7270.7270.727* \B0.7160.7160.7160.716* \B0.9310.9310.9310.931* 0.0160.0160.0160.016 0.0480.0480.0480.048 98.398.398.398.3 92.392.392.392.3 94.294.294.294.2 98.598.598.598.5 95.895.895.895.8
(w/ golden factors) Sampling 0.6070.6070.6070.607 0.6930.6930.6930.693 0.6870.6870.6870.687 0.9220.9220.9220.922 \B0.0200.0200.0200.020 \B0.0700.0700.0700.070 98.798.798.798.7 89.989.989.989.9 \B94.594.594.594.5 98.698.698.698.6 95.495.495.495.4
GtS \B0.6400.6400.6400.640+ 0.727+ 0.716+ 0.9290.9290.9290.929+ \B0.0200.0200.0200.020 0.0670.0670.0670.067 \B99.099.099.099.0 \B93.293.293.293.2+ \B94.594.594.594.5+ \B99.299.299.299.2 \B96.596.596.596.5
Ground-truth captions 1.0001.0001.0001.000 1.0001.0001.0001.000 1.0001.0001.0001.000 1.0001.0001.0001.000 0.0200.0200.0200.020 0.0710.0710.0710.071 99.199.199.199.1 99.199.199.199.1 99.699.699.699.6 99.499.499.499.4 99.399.399.399.3
  • */+

    Statistically significant improvement from the model trained using the original captions with greedy search (*) / sampling (+).

  • †/‡

    Statistically significant improvement from greedy search with the model trained using the original captions (†) / FCC (‡).

The results on speaking-style captioning and style factor classification are listed in Table 1. First, as shown in the bottom line, it was found that the accuracies of factor class predictions from ground-truth captions using GPT were more than 99 % in all the factors. This suggests that GPT can extract speaking style factors from captions almost perfectly. Compared to the captioning model, the first and the second rows, our model significantly outperformed the conventional model in all tasks333The StyleCap results were with sentence rephrasing augmentation, while ours were not. Our preliminary experiments found that rephrasing yielded no significant improvement in most setups.. As discussed in the following sections, all the components in our model were improved. With the same model structures shown in the second, fourth, and sixth rows, FCC showed improvements in BLEU-4, ROUGE-L, METEOR, and BERTScore. These differences were statistically significant according to the bootstrap-based test (p𝑝pitalic_p<.05, n𝑛nitalic_n=1000) except for FCC with predicted factors in ROUGE-L. On the other hand, the performance of style factor classification was also improved but the differences were not significant. These indicate that both the conventional captioning system and FCC can learn information of speaking-style factors from input speech, but FCC yields better captions. One possible reason is that the factor prediction in the first step helps the model to focus on solving syntactic tasks in the caption generation step. Comparing greedy and sampling-based search on caption and FCC-based models, the sampling search yielded improved output diversity as shown by distinct-1/2, but degraded both captioning and factor classification accuracies. On the other hand, FCC with GtS showed no statistical degradation in either captioning or factor classification metrics from greedy search, while statistically improving caption diversity as shown by distinct-1/2 (p𝑝pitalic_p<.05, n𝑛nitalic_n=1000). These results show that the proposed FCC with GtS generates accurate and diverse captions. Note that FCC with sampling-based search exhibited significant degradations on both tasks, which confirms that the sampling-based prediction of style factors is inappropriate. It is finally worth mentioning that in all metrics, there is no statistical degradation between the proposed FCC with predicted factors and that with golden factors.

Table 2: Comparison of the speech encoders. The component used in StyleCap [11] is underlined.
\eightpt
Style captioning Style-factor classification
B@4\uparrow ROU\uparrow Gender Pitch Speed Volume
WavLM-B-plus 0.419 0.513 97.8 77.6 92.4 76.2
HuBERT-L 0.5170.5170.5170.517 0.6260.6260.6260.626 98.298.298.298.2 80.180.180.180.1 87.587.587.587.5 99.0
WavLM-L 0.5970.5970.5970.597 0.6930.6930.6930.693 99.0 88.488.488.488.4 92.592.592.592.5 97.897.897.897.8
Whisper-L 0.613 0.710 97.997.997.997.9 91.4 93.3 97.997.997.997.9
Table 3: Comparison of the bridge networks.
\eightpt
Style captioning Style-factor classification
B@4\uparrow ROU\uparrow Gender Pitch Speed Volume
W.sum + AvP 0.5050.5050.5050.505 0.6150.6150.6150.615 98.198.198.198.1 85.785.785.785.7 81.481.481.481.4 95.595.595.595.5
W.sum + CNN 0.3830.3830.3830.383 0.5010.5010.5010.501 98.198.198.198.1 74.374.374.374.3 79.479.479.479.4 88.688.688.688.6
W.sum + QF 0.561 0.668 98.6 88.3 91.7 97.0
TLTR-seg 0.5780.5780.5780.578 0.6660.6660.6660.666 97.897.897.897.8 88.488.488.488.4 89.989.989.989.9 98.2
TLTR-utt 0.613 0.710 97.997.997.997.9 91.4 93.3 97.997.997.997.9
Table 4: Comparison of the text decoders.
\eightpt
Style captioning Style-factor classification
Base LoRA B@4\uparrow ROU\uparrow Gender Pitch Speed Volume
7B 0.578 0.684 97.7 89.0 93.6 98.6
0.6020.6020.6020.602 0.6950.6950.6950.695 98.198.198.198.1 91.5 92.892.892.892.8 94.694.694.694.6
7B-chat 0.5750.5750.5750.575 0.6750.6750.6750.675 98.8 88.388.388.388.3 92.992.992.992.9 96.596.596.596.5
0.613 0.710 97.997.997.997.9 91.491.491.491.4 93.393.393.393.3 97.997.997.997.9

4.2.2 Ablation study: Encoder, Bridge network, and Decoder

To investigate the best components of the speaking-style captioning model, we compared several widely used modules in each component. The base structure uses Whisper-L as the speech encoder, TLTR-utt as the bridge network, and 7B-chat-LoRA as the text decoder. Each model was trained to output the original speaking-style captions and greedy search was used for decoding, which corresponds to the second row in Table 1. The training hyperparameters were the same as those in the main experiment. Due to space limitations, we report here only BLEU-4, ROUGE-L, and four classification accuracies.

Comparisons of the speech encoders with WavLM-base-plus [23] (WavLM-B-plus), WavLM-large (WavLM-L), and HuBERT-large [24] (HuBERT-L) are detailed in Table 2. WavLM-B-plus showed lower accuracy than the other encoders in pitch and volume prediction, same as the previous work [11]. With regard to the remaining three encoders, Whisper-L achieved the best performance in both captioning and factor classification tasks. This indicates that style captioning performance is highly dependent on the capability of the speech encoder, and that well trained large-scale models will provide better results.

Various bridge networks including the weighted sum of the intermediate layer outputs and average pooling (W.sum + AvP), CNN (W.sum + CNN), or BLSTM+Q-former (W.sum + QF), and TLTR with segment-level outputs (TLTR-seg, removing the temporal pooling as LTU-AS [12]) are also compared in Table 3. The results indicate that capturing temporal context in the bridge is effective (1, 2 vs. 3, 4, 5 rows), and that outputting a single-length embedding is more suitable than sequential embeddings in speaking style captioning and style factor classification (4 vs. 5 rows).

Comparisons of the text decoders, LLaMA-2 7B or 7B-chat with and without LoRA, are shown in Table 4. LoRA improved speaking-style captioning and pitch classification, but the performance difference between text decoders is smaller than that of the speech encoders and bridge networks. This is because the variation in style captions was limited in this dataset, and so did not require a strong text generation module.

5 Conclusion

This paper presented FCC, a novel speaking-style captioning method. FCC uses factor and caption terms that support the generation of captions conditioned by the predicted speech factors. GtS, greedy-then-sampling, decoding initially evaluates the speech factors by maximum likelihood criteria and then generates captions, which yields diverse outputs while preventing factor prediction errors. We also investigated the optimal Speech-Text model components. Experiments showed that our method significantly outperformed the conventional style captioning model, and that FCC with GtS can generate accurate and diverse style captions while identifying style factors.

References

  • [1] Z. Guo, Y. Leng, Y. Wu, S. Zhao, and X. Tan, “PromptTTS: Controllable text-to-speech with text descriptions,” in Proc. ICASSP, 2023.
  • [2] R. Shimizu, R. Yamamoto, M. Kawamura, Y. Shirahata, H. Doi, T. Komatsu, and K. Tachibana, “PromptTTS++: Controlling speaker identity in prompt-based text-to-speech using natural language descriptions,” in Proc. ICASSP (to appear), 2024.
  • [3] S. Ji, J. Zuo, M. Fang, Z. Jiang, F. Chen, X. Duan, B. Huai, and Z. Zhao, “TextrolSpeech: A text style control speech corpus with codec language text-to-speech models,” in Proc. ICASSP (to appear), 2024.
  • [4] J. Yao, Y. Yang, Y. Lei, Z. Ning, Y. Hu, Y. Pan, J. Yin, H. Zhou, H. Lu, and L. Xie, “PromptVC: Flexible stylistic voice conversion in latent space driven by natural language prompts,” in Proc. ICASSP (to appear), 2024.
  • [5] J. Miehle, I. Feustel, J. Hornauer, W. Minker, and S. Ultes, “Estimating user communication styles for spoken dialogue systems,” in Proc. LREC, 2020, pp. 540–548.
  • [6] G.-T. Lin, C.-H. Chiang, and H. yi Lee, “Advancing large language models to capture varied speaking styles and respond properly in spoken conversations,” 2024, arXiv:2402.12786.
  • [7] H. Zhang, M. Mimura, T. Kawahara, and K. Ishizuka, “Selective multi-task learning for speech emotion recognition using corpora of different styles,” in Proc. ICASSP, 2022, pp. 7707–7711.
  • [8] M. Sharma, “Multi-lingual multi-task speech emotion recognition using wav2vec 2.0,” in Proc. ICASSP, 2022, pp. 6907–6911.
  • [9] T. Shinozaki, M. Ostendorf, and L. Atlas, “Characteristics of speaking style and implications for speech recognition,” Journal of Acoust. Soc. Am., vol. 126, no. 3, pp. 1500–1510, 2009.
  • [10] A. Veiga, S. Candeias, D. Celorico, J. Proença, and F. Perdigão, “Towards automatic classification of speech styles,” in Proc. Computational Processing of the Portuguese Language, 2012, pp. 421–426.
  • [11] K. Yamauchi, Y. Ijima, and Y. Saito, “StyleCap: Automatic speaking-style captioning from speech based on speech and language self-supervised learning models,” in Proc. ICASSP (to apper), 2024.
  • [12] Y. Gong, A. H. Liu, H. Luo, L. Karlinsky, and J. Glass, “Joint audio and speech understanding,” in Proc. ASRU, 2023.
  • [13] Y. Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-Audio: Advancing universal audio understanding via unified large-scale audio-language models,” 2023, arXiv:2311.07919.
  • [14] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, “The curious case of neural text degeneration,” in Proc. ICLR, 2020.
  • [15] J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in Advances in NeurIPS, vol. 35, 2022, pp. 24 824–24 837.
  • [16] OpenAI, “GPT-4 Technical Report,” 2023, arXiv:2303.08774.
  • [17] K.-W. Chang, W.-C. Tseng, S.-W. Li, and H. yi Lee, “An exploration of prompt tuning on generative spoken language model for speech processing tasks,” in Proc. Interspeech, 2022, pp. 5005–5009.
  • [18] S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,” in Advances in NeurIPS, vol. 36, 2023, pp. 18 090–18 108.
  • [19] M. Wang, W. Han, I. Shafran, Z. Wu, C.-C. Chiu, Y. Cao, N. Chen, Y. Zhang, H. Soltau, P. K. Rubenstein, L. Zilka, D. Yu, G. Pundak, N. Siddhartha, J. Schalkwyk, and Y. Wu, “SLM: Bridge the thin gap between speech and text foundation models,” in Proc. ASRU, 2023.
  • [20] T. Wang, L. Zhou, Z. Zhang, Y. Wu, S. Liu, Y. Gaur, Z. Chen, J. Li, and F. Wei, “VioLA: Unified codec language models for speech recognition, synthesis, and translation,” 2023, arXiv:2305.16107.
  • [21] P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Borsos, F. de Chaumont Quitry, P. Chen, D. E. Badawy et al., “AudioPaLM: A large language model that can speak and listen,” 2023, arXiv:2306.12925.
  • [22] Y. Xu, H. Chen, J. Yu, Q. Huang, Z. Wu, S. Zhang, G. Li, Y. Luo, and R. Gu, “SECap: Speech emotion captioning with large language model,” in Proc. AAAI (to appear), 2024.
  • [23] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  • [24] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  • [25] Y. Gong, S. Khurana, L. Karlinsky, and J. Glass, “Whisper-AT: Noise-robust automatic speech recognizers are also strong general audio event taggers,” in Proc. Interspeech, 2023, pp. 2798–2802.
  • [26] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu et al., “Llama 2: Open foundation and fine-tuned chat models,” 2023, arXiv:2307.09288.
  • [27] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in Proc. ICLR, 2022.
  • [28] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “LibriTTS: A corpus derived from librispeech for text-to-speech,” in Proc. Interspeech, 2019, pp. 1526–1530.
  • [29] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in Proc. ICML, 2023, pp. 28 492–28 518.
  • [30] J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: bootstrap** language-image pre-training with frozen image encoders and large language models,” in Proc. ICML, 2023.
  • [31] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. ICLR, 2017.
  • [32] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,” in Proc. NeurIPS EMCC Workshop, 2019.
  • [33] L. Von Werra, L. Tunstall, A. Thakur, S. Luccioni, T. Thrush, A. Piktus, F. Marty, N. Rajani, V. Mustar, and H. Ngo, “Evaluate & evaluation on the hub: Better best practices for data and model measurements,” in Proc. EMNLP, 2022, pp. 128–136.