-
Speech Translation with Large Language Models: An Industrial Practice
Authors:
Zhichao Huang,
Rong Ye,
Tom Ko,
Qianqian Dong,
Shanbo Cheng,
Mingxuan Wang,
Hang Li
Abstract:
Given the great success of large language models (LLMs) across various tasks, in this paper, we introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained LLM. By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations, even from long au…
▽ More
Given the great success of large language models (LLMs) across various tasks, in this paper, we introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained LLM. By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations, even from long audio inputs. Furthermore, our findings indicate that the implementation of Chain-of-Thought (CoT) prompting can yield advantages in the context of LLM-ST. Through rigorous experimentation on English and Chinese datasets, we showcase the exceptional performance of LLM-ST, establishing a new benchmark in the field of speech translation. Demo: https://speechtranslation.github.io/llm-st/.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
RepCodec: A Speech Representation Codec for Speech Tokenization
Authors:
Zhichao Huang,
Chutong Meng,
Tom Ko
Abstract:
With recent rapid growth of large language models (LLMs), discrete speech tokenization has played an important role for injecting speech into LLMs. However, this discretization gives rise to a loss of information, consequently impairing overall performance. To improve the performance of these discrete speech tokens, we present RepCodec, a novel speech representation codec for semantic speech token…
▽ More
With recent rapid growth of large language models (LLMs), discrete speech tokenization has played an important role for injecting speech into LLMs. However, this discretization gives rise to a loss of information, consequently impairing overall performance. To improve the performance of these discrete speech tokens, we present RepCodec, a novel speech representation codec for semantic speech tokenization. In contrast to audio codecs which reconstruct the raw audio, RepCodec learns a vector quantization codebook through reconstructing speech representations from speech encoders like HuBERT or data2vec. Together, the speech encoder, the codec encoder and the vector quantization codebook form a pipeline for converting speech waveforms into semantic tokens. The extensive experiments illustrate that RepCodec, by virtue of its enhanced information retention capacity, significantly outperforms the widely used k-means clustering approach in both speech understanding and generation. Furthermore, this superiority extends across various speech encoders and languages, affirming the robustness of RepCodec. We believe our method can facilitate large language modeling research on speech processing.
△ Less
Submitted 6 June, 2024; v1 submitted 31 August, 2023;
originally announced September 2023.
-
Recent Advances in Direct Speech-to-text Translation
Authors:
Chen Xu,
Rong Ye,
Qianqian Dong,
Chengqi Zhao,
Tom Ko,
Mingxuan Wang,
Tong Xiao,
**gbo Zhu
Abstract:
Recently, speech-to-text translation has attracted more and more attention and many studies have emerged rapidly. In this paper, we present a comprehensive survey on direct speech translation aiming to summarize the current state-of-the-art techniques. First, we categorize the existing research work into three directions based on the main challenges -- modeling burden, data scarcity, and applicati…
▽ More
Recently, speech-to-text translation has attracted more and more attention and many studies have emerged rapidly. In this paper, we present a comprehensive survey on direct speech translation aiming to summarize the current state-of-the-art techniques. First, we categorize the existing research work into three directions based on the main challenges -- modeling burden, data scarcity, and application issues. To tackle the problem of modeling burden, two main structures have been proposed, encoder-decoder framework (Transformer and the variants) and multitask frameworks. For the challenge of data scarcity, recent work resorts to many sophisticated techniques, such as data augmentation, pre-training, knowledge distillation, and multilingual modeling. We analyze and summarize the application issues, which include real-time, segmentation, named entity, gender bias, and code-switching. Finally, we discuss some promising directions for future work.
△ Less
Submitted 20 June, 2023;
originally announced June 2023.
-
MOSPC: MOS Prediction Based on Pairwise Comparison
Authors:
Kexin Wang,
Yunlong Zhao,
Qianqian Dong,
Tom Ko,
Mingxuan Wang
Abstract:
As a subjective metric to evaluate the quality of synthesized speech, Mean opinion score~(MOS) usually requires multiple annotators to score the same speech. Such an annotation approach requires a lot of manpower and is also time-consuming. MOS prediction model for automatic evaluation can significantly reduce labor cost. In previous works, it is difficult to accurately rank the quality of speech…
▽ More
As a subjective metric to evaluate the quality of synthesized speech, Mean opinion score~(MOS) usually requires multiple annotators to score the same speech. Such an annotation approach requires a lot of manpower and is also time-consuming. MOS prediction model for automatic evaluation can significantly reduce labor cost. In previous works, it is difficult to accurately rank the quality of speech when the MOS scores are close. However, in practical applications, it is more important to correctly rank the quality of synthesis systems or sentences than simply predicting MOS scores. Meanwhile, as each annotator scores multiple audios during annotation, the score is probably a relative value based on the first or the first few speech scores given by the annotator. Motivated by the above two points, we propose a general framework for MOS prediction based on pair comparison (MOSPC), and we utilize C-Mixup algorithm to enhance the generalization performance of MOSPC. The experiments on BVCC and VCC2018 show that our framework outperforms the baselines on most of the correlation coefficient metrics, especially on the metric KTAU related to quality ranking. And our framework also surpasses the strong baseline in ranking accuracy on each fine-grained segment. These results indicate that our framework contributes to improving the ranking accuracy of speech quality.
△ Less
Submitted 18 June, 2023;
originally announced June 2023.
-
PolyVoice: Language Models for Speech to Speech Translation
Authors:
Qianqian Dong,
Zhiying Huang,
Qiao Tian,
Chen Xu,
Tom Ko,
Yunlong Zhao,
Siyuan Feng,
Tang Li,
Kexin Wang,
Xuxin Cheng,
Fengpeng Yue,
Ye Bai,
Xi Chen,
Lu Lu,
Zejun Ma,
Yu** Wang,
Mingxuan Wang,
Yuxuan Wang
Abstract:
We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt…
▽ More
We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model. This grants our framework the ability to preserve the voice characteristics and the speaking style of the original speech. We examine our system on Chinese $\rightarrow$ English and English $\rightarrow$ Spanish pairs. Experimental results show that our system can generate speech with high translation quality and audio quality. Speech samples are available at https://speechtranslation.github.io/polyvoice.
△ Less
Submitted 13 June, 2023; v1 submitted 5 June, 2023;
originally announced June 2023.
-
DUB: Discrete Unit Back-translation for Speech Translation
Authors:
Dong Zhang,
Rong Ye,
Tom Ko,
Mingxuan Wang,
Yaqian Zhou
Abstract:
How can speech-to-text translation (ST) perform as well as machine translation (MT)? The key point is to bridge the modality gap between speech and text so that useful MT techniques can be applied to ST. Recently, the approach of representing speech with unsupervised discrete units yields a new way to ease the modality problem. This motivates us to propose Discrete Unit Back-translation (DUB) to a…
▽ More
How can speech-to-text translation (ST) perform as well as machine translation (MT)? The key point is to bridge the modality gap between speech and text so that useful MT techniques can be applied to ST. Recently, the approach of representing speech with unsupervised discrete units yields a new way to ease the modality problem. This motivates us to propose Discrete Unit Back-translation (DUB) to answer two questions: (1) Is it better to represent speech with discrete units than with continuous features in direct ST? (2) How much benefit can useful MT techniques bring to ST? With DUB, the back-translation technique can successfully be applied on direct ST and obtains an average boost of 5.5 BLEU on MuST-C En-De/Fr/Es. In the low-resource language scenario, our method achieves comparable performance to existing methods that rely on large-scale external data. Code and models are available at https://github.com/0nutation/DUB.
△ Less
Submitted 18 May, 2023;
originally announced May 2023.
-
Model Predictive Control of Smart Districts Participating in Frequency Regulation Market: A Case Study of Using Heating Network Storage
Authors:
Hikaru Hoshino,
T. John Koo,
Yun-Chung Chu,
Yoshihiko Susuki
Abstract:
Flexibility provided by Combined Heat and Power (CHP) units in district heating networks is an important means to cope with increasing penetration of intermittent renewable energy resources, and various methods have been proposed to exploit thermal storage tanks installed in these networks. This paper studies a novel problem motivated by an example of district heating and cooling networks in Japan…
▽ More
Flexibility provided by Combined Heat and Power (CHP) units in district heating networks is an important means to cope with increasing penetration of intermittent renewable energy resources, and various methods have been proposed to exploit thermal storage tanks installed in these networks. This paper studies a novel problem motivated by an example of district heating and cooling networks in Japan, where high-temperature steam is used as the heating medium. In steam-based networks, storage tanks are usually absent, and there is a strong need to utilize thermal inertia of the pipeline network as storage. However, this type of use of a heating network directly affects the operating condition of the network, and assuring safety and supply quality at the use side is an open problem. To address this, we formulate a novel control problem to utilize CHP units in frequency regulation market while satisfying physical constraints on a steam network described by a nonlinear model capturing dynamics of heat flows and heat accumulation in the network. Furthermore, a Model Predictive Control (MPC) framework is proposed to solve this problem. By consistently combining several nonlinear control techniques, a computationally efficient MPC controller is obtained and shown to work in real-time.
△ Less
Submitted 11 May, 2023;
originally announced May 2023.
-
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
Authors:
Xinhao Mei,
Chutong Meng,
Haohe Liu,
Qiuqiang Kong,
Tom Ko,
Chengqi Zhao,
Mark D. Plumbley,
Yuexian Zou,
Wenwu Wang
Abstract:
The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years. However, researchers face challenges due to the costly and time-consuming collection process of existing audio-language datasets, which are limited in size. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approx…
▽ More
The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years. However, researchers face challenges due to the costly and time-consuming collection process of existing audio-language datasets, which are limited in size. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions. We sourced audio clips and their raw descriptions from web sources and a sound event detection dataset. However, the online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. To overcome this issue, we propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically. We conduct a comprehensive analysis of the characteristics of WavCaps dataset and evaluate it on multiple downstream audio-language multimodal learning tasks. The systems trained on WavCaps outperform previous state-of-the-art (SOTA) models by a significant margin. Our aspiration is for the WavCaps dataset we have proposed to facilitate research in audio-language multimodal learning and demonstrate the potential of utilizing ChatGPT to enhance academic research. Our dataset and codes are available at https://github.com/XinhaoMei/WavCaps.
△ Less
Submitted 30 March, 2023;
originally announced March 2023.
-
M3ST: Mix at Three Levels for Speech Translation
Authors:
Xuxin Cheng,
Qianqian Dong,
Fengpeng Yue,
Tom Ko,
Mingxuan Wang,
Yuexian Zou
Abstract:
How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)? It's well known that data augmentation is an efficient method to improve performance for many tasks by enlarging the dataset. In this paper, we propose Mix at three levels for Speech Translation (M^3ST) method to increase the diversity of the augmented training corpus. Specifically, we conduct two phases of fine…
▽ More
How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)? It's well known that data augmentation is an efficient method to improve performance for many tasks by enlarging the dataset. In this paper, we propose Mix at three levels for Speech Translation (M^3ST) method to increase the diversity of the augmented training corpus. Specifically, we conduct two phases of fine-tuning based on a pre-trained model using external machine translation (MT) data. In the first stage of fine-tuning, we mix the training corpus at three levels, including word level, sentence level and frame level, and fine-tune the entire model with mixed data. At the second stage of fine-tuning, we take both original speech sequences and original text sequences in parallel into the model to fine-tune the network, and use Jensen-Shannon divergence to regularize their outputs. Experiments on MuST-C speech translation benchmark and analysis show that M^3ST outperforms current strong baselines and achieves state-of-the-art results on eight directions with an average BLEU of 29.9.
△ Less
Submitted 7 December, 2022;
originally announced December 2022.
-
Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention
Authors:
Xubo Liu,
Qiushi Huang,
Xinhao Mei,
Haohe Liu,
Qiuqiang Kong,
Jianyuan Sun,
Shengchen Li,
Tom Ko,
Yu Zhang,
Lilian H. Tang,
Mark D. Plumbley,
Volkan Kılıç,
Wenwu Wang
Abstract:
Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sound…
▽ More
Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-the-art results on machine translation metrics.
△ Less
Submitted 28 May, 2023; v1 submitted 28 October, 2022;
originally announced October 2022.
-
CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning
Authors:
Chutong Meng,
Junyi Ao,
Tom Ko,
Mingxuan Wang,
Haizhou Li
Abstract:
Speech is the surface form of a finite set of phonetic units, which can be represented by discrete codes. We propose the Code BERT (CoBERT) approach for self-supervised speech representation learning. The idea is to convert an utterance to a sequence of discrete codes, and perform code representation learning, where we predict the code representations based on a masked view of the original speech…
▽ More
Speech is the surface form of a finite set of phonetic units, which can be represented by discrete codes. We propose the Code BERT (CoBERT) approach for self-supervised speech representation learning. The idea is to convert an utterance to a sequence of discrete codes, and perform code representation learning, where we predict the code representations based on a masked view of the original speech input. Unlike the prior self-distillation approaches of which the teacher and the student are of the same modality, our target model predicts representations from a different modality. CoBERT outperforms the most recent state-of-the-art performance on the ASR task and brings significant improvements on the SUPERB speech translation (ST) task. Our code and models are released at https://github.com/mct10/CoBERT.
△ Less
Submitted 5 July, 2023; v1 submitted 8 October, 2022;
originally announced October 2022.
-
A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis
Authors:
Qibing Bai,
Tom Ko,
Yu Zhang
Abstract:
In human speech, the attitude of a speaker cannot be fully expressed only by the textual content. It has to come along with the intonation. Declarative questions are commonly used in daily Cantonese conversations, and they are usually uttered with rising intonation. Vanilla neural text-to-speech (TTS) systems are not capable of synthesizing rising intonation for these sentences due to the loss of…
▽ More
In human speech, the attitude of a speaker cannot be fully expressed only by the textual content. It has to come along with the intonation. Declarative questions are commonly used in daily Cantonese conversations, and they are usually uttered with rising intonation. Vanilla neural text-to-speech (TTS) systems are not capable of synthesizing rising intonation for these sentences due to the loss of semantic information. Though it has become more common to complement the systems with extra language models, their performance in modeling rising intonation is not well studied. In this paper, we propose to complement the Cantonese TTS model with a BERT-based statement/question classifier. We design different training strategies and compare their performance. We conduct our experiments on a Cantonese corpus named CanTTS. Empirical results show that the separate training approach obtains the best generalization performance and feasibility.
△ Less
Submitted 3 August, 2022;
originally announced August 2022.
-
Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation
Authors:
Qianqian Dong,
Fengpeng Yue,
Tom Ko,
Mingxuan Wang,
Qibing Bai,
Yu Zhang
Abstract:
Direct Speech-to-speech translation (S2ST) has drawn more and more attention recently. The task is very challenging due to data scarcity and complex speech-to-speech map**. In this paper, we report our recent achievements in S2ST. Firstly, we build a S2ST Transformer baseline which outperforms the original Translatotron. Secondly, we utilize the external data by pseudo-labeling and obtain a new…
▽ More
Direct Speech-to-speech translation (S2ST) has drawn more and more attention recently. The task is very challenging due to data scarcity and complex speech-to-speech map**. In this paper, we report our recent achievements in S2ST. Firstly, we build a S2ST Transformer baseline which outperforms the original Translatotron. Secondly, we utilize the external data by pseudo-labeling and obtain a new state-of-the-art result on the Fisher English-to-Spanish test set. Indeed, we exploit the pseudo data with a combination of popular techniques which are not trivial when applied to S2ST. Moreover, we evaluate our approach on both syntactically similar (Spanish-English) and distant (English-Chinese) language pairs. Our implementation is available at https://github.com/fengpeng-yue/speech-to-speech-translation.
△ Less
Submitted 18 May, 2022;
originally announced May 2022.
-
GigaST: A 10,000-hour Pseudo Speech Translation Corpus
Authors:
Rong Ye,
Chengqi Zhao,
Tom Ko,
Chutong Meng,
Tao Wang,
Mingxuan Wang,
Jun Cao
Abstract:
This paper introduces GigaST, a large-scale pseudo speech translation (ST) corpus. We create the corpus by translating the text in GigaSpeech, an English ASR corpus, into German and Chinese. The training set is translated by a strong machine translation system and the test set is translated by human. ST models trained with an addition of our corpus obtain new state-of-the-art results on the MuST-C…
▽ More
This paper introduces GigaST, a large-scale pseudo speech translation (ST) corpus. We create the corpus by translating the text in GigaSpeech, an English ASR corpus, into German and Chinese. The training set is translated by a strong machine translation system and the test set is translated by human. ST models trained with an addition of our corpus obtain new state-of-the-art results on the MuST-C English-German benchmark test set. We provide a detailed description of the translation process and verify its quality. We make the translated text data public and hope to facilitate research in speech translation. Additionally, we also release the training scripts on NeurST to make it easy to replicate our systems. GigaST dataset is available at https://st-benchmark.github.io/resources/GigaST.
△ Less
Submitted 6 June, 2023; v1 submitted 8 April, 2022;
originally announced April 2022.
-
Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data
Authors:
Junyi Ao,
Ziqiang Zhang,
Long Zhou,
Shujie Liu,
Haizhou Li,
Tom Ko,
Lirong Dai,
**yu Li,
Yao Qian,
Furu Wei
Abstract:
This paper studies a novel pre-training technique with unpaired speech data, Speech2C, for encoder-decoder based automatic speech recognition (ASR). Within a multi-task learning framework, we introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes, derived from an offline clustering model. One is to predict the pseudo codes via masked language mode…
▽ More
This paper studies a novel pre-training technique with unpaired speech data, Speech2C, for encoder-decoder based automatic speech recognition (ASR). Within a multi-task learning framework, we introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes, derived from an offline clustering model. One is to predict the pseudo codes via masked language modeling in encoder output, like HuBERT model, while the other lets the decoder learn to reconstruct pseudo codes autoregressively instead of generating textual scripts. In this way, the decoder learns to reconstruct original speech information with codes before learning to generate correct text. Comprehensive experiments on the LibriSpeech corpus show that the proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training, and also outperforms significantly the state-of-the-art wav2vec 2.0 and HuBERT on fine-tuning subsets of 10h and 100h. We release our code and model at https://github.com/microsoft/SpeechT5/tree/main/Speech2C.
△ Less
Submitted 20 June, 2022; v1 submitted 31 March, 2022;
originally announced March 2022.
-
LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT
Authors:
Rui Wang,
Qibing Bai,
Junyi Ao,
Long Zhou,
Zhixiang Xiong,
Zhihua Wei,
Yu Zhang,
Tom Ko,
Haizhou Li
Abstract:
Self-supervised speech representation learning has shown promising results in various speech processing tasks. However, the pre-trained models, e.g., HuBERT, are storage-intensive Transformers, limiting their scope of applications under low-resource settings. To this end, we propose LightHuBERT, a once-for-all Transformer compression framework, to find the desired architectures automatically by pr…
▽ More
Self-supervised speech representation learning has shown promising results in various speech processing tasks. However, the pre-trained models, e.g., HuBERT, are storage-intensive Transformers, limiting their scope of applications under low-resource settings. To this end, we propose LightHuBERT, a once-for-all Transformer compression framework, to find the desired architectures automatically by pruning structured parameters. More precisely, we create a Transformer-based supernet that is nested with thousands of weight-sharing subnets and design a two-stage distillation strategy to leverage the contextualized latent representations from HuBERT. Experiments on automatic speech recognition (ASR) and the SUPERB benchmark show the proposed LightHuBERT enables over $10^9$ architectures concerning the embedding dimension, attention dimension, head number, feed-forward network ratio, and network depth. LightHuBERT outperforms the original HuBERT on ASR and five SUPERB tasks with the HuBERT size, achieves comparable performance to the teacher model in most tasks with a reduction of 29% parameters, and obtains a $3.5\times$ compression ratio in three SUPERB tasks, e.g., automatic speaker verification, keyword spotting, and intent classification, with a slight accuracy loss. The code and pre-trained models are available at https://github.com/mechanicalsea/lighthubert.
△ Less
Submitted 18 June, 2022; v1 submitted 29 March, 2022;
originally announced March 2022.
-
SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
Authors:
Junyi Ao,
Rui Wang,
Long Zhou,
Chengyi Wang,
Shuo Ren,
Yu Wu,
Shujie Liu,
Tom Ko,
Qing Li,
Yu Zhang,
Zhihua Wei,
Yao Qian,
**yu Li,
Furu Wei
Abstract:
Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After prepro…
▽ More
Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, ho** to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification. We release our code and model at https://github.com/microsoft/SpeechT5.
△ Less
Submitted 24 May, 2022; v1 submitted 14 October, 2021;
originally announced October 2021.
-
Multi-View Self-Attention Based Transformer for Speaker Recognition
Authors:
Rui Wang,
Junyi Ao,
Long Zhou,
Shujie Liu,
Zhihua Wei,
Tom Ko,
Qing Li,
Yu Zhang
Abstract:
Initially developed for natural language processing (NLP), Transformer model is now widely used for speech processing tasks such as speaker recognition, due to its powerful sequence modeling capabilities. However, conventional self-attention mechanisms are originally designed for modeling textual sequence without considering the characteristics of speech and speaker modeling. Besides, different Tr…
▽ More
Initially developed for natural language processing (NLP), Transformer model is now widely used for speech processing tasks such as speaker recognition, due to its powerful sequence modeling capabilities. However, conventional self-attention mechanisms are originally designed for modeling textual sequence without considering the characteristics of speech and speaker modeling. Besides, different Transformer variants for speaker recognition have not been well studied. In this work, we propose a novel multi-view self-attention mechanism and present an empirical study of different Transformer variants with or without the proposed attention mechanism for speaker recognition. Specifically, to balance the capabilities of capturing global dependencies and modeling the locality, we propose a multi-view self-attention mechanism for speaker Transformer, in which different attention heads can attend to different ranges of the receptive field. Furthermore, we introduce and compare five Transformer variants with different network architectures, embedding locations, and pooling methods to learn speaker embeddings. Experimental results on the VoxCeleb1 and VoxCeleb2 datasets show that the proposed multi-view self-attention mechanism achieves improvement in the performance of speaker recognition, and the proposed speaker Transformer network attains excellent results compared with state-of-the-art models.
△ Less
Submitted 27 January, 2022; v1 submitted 11 October, 2021;
originally announced October 2021.
-
An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning
Authors:
Xinhao Mei,
Qiushi Huang,
Xubo Liu,
Gengyun Chen,
**gqian Wu,
Yusong Wu,
**zheng Zhao,
Shengchen Li,
Tom Ko,
H Lilian Tang,
Xi Shao,
Mark D. Plumbley,
Wenwu Wang
Abstract:
Automated audio captioning aims to use natural language to describe the content of audio data. This paper presents an audio captioning system with an encoder-decoder architecture, where the decoder predicts words based on audio features extracted by the encoder. To improve the proposed system, transfer learning from either an upstream audio-related task or a large in-domain dataset is introduced t…
▽ More
Automated audio captioning aims to use natural language to describe the content of audio data. This paper presents an audio captioning system with an encoder-decoder architecture, where the decoder predicts words based on audio features extracted by the encoder. To improve the proposed system, transfer learning from either an upstream audio-related task or a large in-domain dataset is introduced to mitigate the problem induced by data scarcity. Besides, evaluation metrics are incorporated into the optimization of the model with reinforcement learning, which helps address the problem of ``exposure bias'' induced by ``teacher forcing'' training strategy and the mismatch between the evaluation metrics and the loss function. The resulting system was ranked 3rd in DCASE 2021 Task 6. Ablation studies are carried out to investigate how much each element in the proposed system can contribute to final performance. The results show that the proposed techniques significantly improve the scores of the evaluation metrics, however, reinforcement learning may impact adversely on the quality of the generated captions.
△ Less
Submitted 5 August, 2021;
originally announced August 2021.
-
CL4AC: A Contrastive Loss for Audio Captioning
Authors:
Xubo Liu,
Qiushi Huang,
Xinhao Mei,
Tom Ko,
H Lilian Tang,
Mark D. Plumbley,
Wenwu Wang
Abstract:
Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip. As shown in the submissions received for Task 6 of the DCASE 2021 Challenges, this problem has received increasing interest in the community. The existing AAC systems are usually based on an encoder-decoder architecture, where the audio signal is encoded in…
▽ More
Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip. As shown in the submissions received for Task 6 of the DCASE 2021 Challenges, this problem has received increasing interest in the community. The existing AAC systems are usually based on an encoder-decoder architecture, where the audio signal is encoded into a latent representation, and aligned with its corresponding text descriptions, then a decoder is used to generate the captions. However, training of an AAC system often encounters the problem of data scarcity, which may lead to inaccurate representation and audio-text alignment. To address this problem, we propose a novel encoder-decoder framework called Contrastive Loss for Audio Captioning (CL4AC). In CL4AC, the self-supervision signals derived from the original audio-text paired data are used to exploit the correspondences between audio and texts by contrasting samples, which can improve the quality of latent representation and the alignment between audio and texts, while trained with limited data. Experiments are performed on the Clotho dataset to show the effectiveness of our proposed approach.
△ Less
Submitted 22 November, 2021; v1 submitted 21 July, 2021;
originally announced July 2021.
-
Exploring Machine Speech Chain for Domain Adaptation and Few-Shot Speaker Adaptation
Authors:
Fengpeng Yue,
Yan Deng,
Lei He,
Tom Ko
Abstract:
Machine Speech Chain, which integrates both end-to-end (E2E) automatic speech recognition (ASR) and text-to-speech (TTS) into one circle for joint training, has been proven to be effective in data augmentation by leveraging large amounts of unpaired data. In this paper, we explore the TTS->ASR pipeline in speech chain to do domain adaptation for both neural TTS and E2E ASR models, with only text d…
▽ More
Machine Speech Chain, which integrates both end-to-end (E2E) automatic speech recognition (ASR) and text-to-speech (TTS) into one circle for joint training, has been proven to be effective in data augmentation by leveraging large amounts of unpaired data. In this paper, we explore the TTS->ASR pipeline in speech chain to do domain adaptation for both neural TTS and E2E ASR models, with only text data from target domain. We conduct experiments by adapting from audiobook domain (LibriSpeech) to presentation domain (TED-LIUM), there is a relative word error rate (WER) reduction of 10% for the E2E ASR model on the TED-LIUM test set, and a relative WER reduction of 51.5% in synthetic speech generated by neural TTS in the presentation domain. Further, we apply few-shot speaker adaptation for the E2E ASR by using a few utterances from target speakers in an unsupervised way, results in additional gains.
△ Less
Submitted 8 April, 2021;
originally announced April 2021.
-
Dynamic Security Analysis of Power Systems by a Sampling-Based Algorithm
Authors:
Qiang Wu,
T. John Koo,
Yoshihiko Susuki
Abstract:
Dynamic security analysis is an important problem of power systems on ensuring safe operation and stable power supply even when certain faults occur. No matter such faults are caused by vulnerabilities of system components, physical attacks, or cyber-attacks that are more related to cyber-security, they eventually affect the physical stability of a power system. Examples of the loss of physical st…
▽ More
Dynamic security analysis is an important problem of power systems on ensuring safe operation and stable power supply even when certain faults occur. No matter such faults are caused by vulnerabilities of system components, physical attacks, or cyber-attacks that are more related to cyber-security, they eventually affect the physical stability of a power system. Examples of the loss of physical stability include the Northeast blackout of 2003 in North America and the 2015 system-wide blackout in Ukraine. The nonlinear hybrid nature, that is, nonlinear continuous dynamics integrated with discrete switching, and the high degree of freedom property of power system dynamics make it challenging to conduct the dynamic security analysis. In this paper, we use the hybrid automaton model to describe the dynamics of a power system and mainly deal with the index-1 differential-algebraic equation models regarding the continuous dynamics in different discrete states. The analysis problem is formulated as a reachability problem of the associated hybrid model. A sampling-based algorithm is then proposed by integrating modeling and randomized simulation of the hybrid dynamics to search for a feasible execution connecting an initial state of the post-fault system and a target set in the desired operation mode. The proposed method enables the use of existing power system simulators for the synthesis of discrete switching and control strategies through randomized simulation. The effectiveness and performance of the proposed approach are demonstrated with an application to the dynamic security analysis of the New England 39-bus benchmark power system exhibiting hybrid dynamics. In addition to evaluating the dynamic security, the proposed method searches for a feasible strategy to ensure the dynamic security of the system in face of disruptions.
△ Less
Submitted 8 November, 2018;
originally announced November 2018.