Skip to main content

Showing 1–22 of 22 results for author: Ko, T

Searching in archive eess. Search in all archives.
.
  1. arXiv:2312.13585  [pdf, other

    cs.CL cs.SD eess.AS

    Speech Translation with Large Language Models: An Industrial Practice

    Authors: Zhichao Huang, Rong Ye, Tom Ko, Qianqian Dong, Shanbo Cheng, Mingxuan Wang, Hang Li

    Abstract: Given the great success of large language models (LLMs) across various tasks, in this paper, we introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained LLM. By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations, even from long au… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: Technical report. 13 pages. Demo: https://speechtranslation.github.io/llm-st/

  2. arXiv:2309.00169  [pdf, other

    eess.AS cs.LG cs.SD

    RepCodec: A Speech Representation Codec for Speech Tokenization

    Authors: Zhichao Huang, Chutong Meng, Tom Ko

    Abstract: With recent rapid growth of large language models (LLMs), discrete speech tokenization has played an important role for injecting speech into LLMs. However, this discretization gives rise to a loss of information, consequently impairing overall performance. To improve the performance of these discrete speech tokens, we present RepCodec, a novel speech representation codec for semantic speech token… ▽ More

    Submitted 6 June, 2024; v1 submitted 31 August, 2023; originally announced September 2023.

  3. arXiv:2306.11646  [pdf, other

    cs.CL eess.AS

    Recent Advances in Direct Speech-to-text Translation

    Authors: Chen Xu, Rong Ye, Qianqian Dong, Chengqi Zhao, Tom Ko, Mingxuan Wang, Tong Xiao, **gbo Zhu

    Abstract: Recently, speech-to-text translation has attracted more and more attention and many studies have emerged rapidly. In this paper, we present a comprehensive survey on direct speech translation aiming to summarize the current state-of-the-art techniques. First, we categorize the existing research work into three directions based on the main challenges -- modeling burden, data scarcity, and applicati… ▽ More

    Submitted 20 June, 2023; originally announced June 2023.

    Comments: An expanded version of the paper accepted by IJCAI2023 survey track

  4. arXiv:2306.10493  [pdf, other

    cs.SD cs.CL eess.AS

    MOSPC: MOS Prediction Based on Pairwise Comparison

    Authors: Kexin Wang, Yunlong Zhao, Qianqian Dong, Tom Ko, Mingxuan Wang

    Abstract: As a subjective metric to evaluate the quality of synthesized speech, Mean opinion score~(MOS) usually requires multiple annotators to score the same speech. Such an annotation approach requires a lot of manpower and is also time-consuming. MOS prediction model for automatic evaluation can significantly reduce labor cost. In previous works, it is difficult to accurately rank the quality of speech… ▽ More

    Submitted 18 June, 2023; originally announced June 2023.

  5. arXiv:2306.02982  [pdf, other

    cs.CL eess.AS

    PolyVoice: Language Models for Speech to Speech Translation

    Authors: Qianqian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng, Fengpeng Yue, Ye Bai, Xi Chen, Lu Lu, Zejun Ma, Yu** Wang, Mingxuan Wang, Yuxuan Wang

    Abstract: We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt… ▽ More

    Submitted 13 June, 2023; v1 submitted 5 June, 2023; originally announced June 2023.

  6. arXiv:2305.11411  [pdf, other

    cs.CL cs.SD eess.AS

    DUB: Discrete Unit Back-translation for Speech Translation

    Authors: Dong Zhang, Rong Ye, Tom Ko, Mingxuan Wang, Yaqian Zhou

    Abstract: How can speech-to-text translation (ST) perform as well as machine translation (MT)? The key point is to bridge the modality gap between speech and text so that useful MT techniques can be applied to ST. Recently, the approach of representing speech with unsupervised discrete units yields a new way to ease the modality problem. This motivates us to propose Discrete Unit Back-translation (DUB) to a… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted to Findings of ACL 2023

  7. arXiv:2305.07198  [pdf, other

    eess.SY

    Model Predictive Control of Smart Districts Participating in Frequency Regulation Market: A Case Study of Using Heating Network Storage

    Authors: Hikaru Hoshino, T. John Koo, Yun-Chung Chu, Yoshihiko Susuki

    Abstract: Flexibility provided by Combined Heat and Power (CHP) units in district heating networks is an important means to cope with increasing penetration of intermittent renewable energy resources, and various methods have been proposed to exploit thermal storage tanks installed in these networks. This paper studies a novel problem motivated by an example of district heating and cooling networks in Japan… ▽ More

    Submitted 11 May, 2023; originally announced May 2023.

  8. arXiv:2303.17395  [pdf, other

    eess.AS cs.CL cs.MM cs.SD

    WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

    Authors: Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu Wang

    Abstract: The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years. However, researchers face challenges due to the costly and time-consuming collection process of existing audio-language datasets, which are limited in size. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approx… ▽ More

    Submitted 30 March, 2023; originally announced March 2023.

    Comments: 12 pages

  9. arXiv:2212.03657  [pdf, other

    cs.CL cs.SD eess.AS

    M3ST: Mix at Three Levels for Speech Translation

    Authors: Xuxin Cheng, Qianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan Wang, Yuexian Zou

    Abstract: How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)? It's well known that data augmentation is an efficient method to improve performance for many tasks by enlarging the dataset. In this paper, we propose Mix at three levels for Speech Translation (M^3ST) method to increase the diversity of the augmented training corpus. Specifically, we conduct two phases of fine… ▽ More

    Submitted 7 December, 2022; originally announced December 2022.

    Comments: Submitted to ICASSP 2023

  10. arXiv:2210.16428  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

    Authors: Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan Sun, Shengchen Li, Tom Ko, Yu Zhang, Lilian H. Tang, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang

    Abstract: Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sound… ▽ More

    Submitted 28 May, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

    Comments: INTERSPEECH 2023

  11. arXiv:2210.04062  [pdf, other

    cs.SD eess.AS

    CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning

    Authors: Chutong Meng, Junyi Ao, Tom Ko, Mingxuan Wang, Haizhou Li

    Abstract: Speech is the surface form of a finite set of phonetic units, which can be represented by discrete codes. We propose the Code BERT (CoBERT) approach for self-supervised speech representation learning. The idea is to convert an utterance to a sequence of discrete codes, and perform code representation learning, where we predict the code representations based on a masked view of the original speech… ▽ More

    Submitted 5 July, 2023; v1 submitted 8 October, 2022; originally announced October 2022.

    Comments: Accepted by Interspeech 2023

  12. arXiv:2208.02189  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis

    Authors: Qibing Bai, Tom Ko, Yu Zhang

    Abstract: In human speech, the attitude of a speaker cannot be fully expressed only by the textual content. It has to come along with the intonation. Declarative questions are commonly used in daily Cantonese conversations, and they are usually uttered with rising intonation. Vanilla neural text-to-speech (TTS) systems are not capable of synthesizing rising intonation for these sentences due to the loss of… ▽ More

    Submitted 3 August, 2022; originally announced August 2022.

    Comments: Accepted by INTERSPEECH 2022

  13. arXiv:2205.08993  [pdf, other

    cs.CL eess.AS

    Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation

    Authors: Qianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan Wang, Qibing Bai, Yu Zhang

    Abstract: Direct Speech-to-speech translation (S2ST) has drawn more and more attention recently. The task is very challenging due to data scarcity and complex speech-to-speech map**. In this paper, we report our recent achievements in S2ST. Firstly, we build a S2ST Transformer baseline which outperforms the original Translatotron. Secondly, we utilize the external data by pseudo-labeling and obtain a new… ▽ More

    Submitted 18 May, 2022; originally announced May 2022.

    Comments: Submitted to INTERSPEECH 2022

  14. arXiv:2204.03939  [pdf, ps, other

    cs.CL cs.SD eess.AS

    GigaST: A 10,000-hour Pseudo Speech Translation Corpus

    Authors: Rong Ye, Chengqi Zhao, Tom Ko, Chutong Meng, Tao Wang, Mingxuan Wang, Jun Cao

    Abstract: This paper introduces GigaST, a large-scale pseudo speech translation (ST) corpus. We create the corpus by translating the text in GigaSpeech, an English ASR corpus, into German and Chinese. The training set is translated by a strong machine translation system and the test set is translated by human. ST models trained with an addition of our corpus obtain new state-of-the-art results on the MuST-C… ▽ More

    Submitted 6 June, 2023; v1 submitted 8 April, 2022; originally announced April 2022.

    Comments: Accepted at Interspeech 2023. GigaST dataset is available at https://st-benchmark.github.io/resources/GigaST

  15. arXiv:2203.17113  [pdf, other

    cs.SD cs.LG eess.AS

    Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data

    Authors: Junyi Ao, Ziqiang Zhang, Long Zhou, Shujie Liu, Haizhou Li, Tom Ko, Lirong Dai, **yu Li, Yao Qian, Furu Wei

    Abstract: This paper studies a novel pre-training technique with unpaired speech data, Speech2C, for encoder-decoder based automatic speech recognition (ASR). Within a multi-task learning framework, we introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes, derived from an offline clustering model. One is to predict the pseudo codes via masked language mode… ▽ More

    Submitted 20 June, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

    Comments: Accepted by Interspeech 2022

  16. arXiv:2203.15610  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT

    Authors: Rui Wang, Qibing Bai, Junyi Ao, Long Zhou, Zhixiang Xiong, Zhihua Wei, Yu Zhang, Tom Ko, Haizhou Li

    Abstract: Self-supervised speech representation learning has shown promising results in various speech processing tasks. However, the pre-trained models, e.g., HuBERT, are storage-intensive Transformers, limiting their scope of applications under low-resource settings. To this end, we propose LightHuBERT, a once-for-all Transformer compression framework, to find the desired architectures automatically by pr… ▽ More

    Submitted 18 June, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: 5 pages, 2 figures, accepted to Insterspeech 2022

  17. arXiv:2110.07205  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

    Authors: Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, **yu Li, Furu Wei

    Abstract: Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After prepro… ▽ More

    Submitted 24 May, 2022; v1 submitted 14 October, 2021; originally announced October 2021.

    Comments: Accepted by ACL 2022 main conference

  18. arXiv:2110.05036  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Multi-View Self-Attention Based Transformer for Speaker Recognition

    Authors: Rui Wang, Junyi Ao, Long Zhou, Shujie Liu, Zhihua Wei, Tom Ko, Qing Li, Yu Zhang

    Abstract: Initially developed for natural language processing (NLP), Transformer model is now widely used for speech processing tasks such as speaker recognition, due to its powerful sequence modeling capabilities. However, conventional self-attention mechanisms are originally designed for modeling textual sequence without considering the characteristics of speech and speaker modeling. Besides, different Tr… ▽ More

    Submitted 27 January, 2022; v1 submitted 11 October, 2021; originally announced October 2021.

    Comments: Paper to appear at ICASSP 2022

  19. arXiv:2108.02752  [pdf, other

    eess.AS cs.SD

    An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning

    Authors: Xinhao Mei, Qiushi Huang, Xubo Liu, Gengyun Chen, **gqian Wu, Yusong Wu, **zheng Zhao, Shengchen Li, Tom Ko, H Lilian Tang, Xi Shao, Mark D. Plumbley, Wenwu Wang

    Abstract: Automated audio captioning aims to use natural language to describe the content of audio data. This paper presents an audio captioning system with an encoder-decoder architecture, where the decoder predicts words based on audio features extracted by the encoder. To improve the proposed system, transfer learning from either an upstream audio-related task or a large in-domain dataset is introduced t… ▽ More

    Submitted 5 August, 2021; originally announced August 2021.

    Comments: 5 pages, 1 figure, submitted to DCASE 2021 workshop

  20. arXiv:2107.09990  [pdf, other

    eess.AS cs.AI cs.SD

    CL4AC: A Contrastive Loss for Audio Captioning

    Authors: Xubo Liu, Qiushi Huang, Xinhao Mei, Tom Ko, H Lilian Tang, Mark D. Plumbley, Wenwu Wang

    Abstract: Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip. As shown in the submissions received for Task 6 of the DCASE 2021 Challenges, this problem has received increasing interest in the community. The existing AAC systems are usually based on an encoder-decoder architecture, where the audio signal is encoded in… ▽ More

    Submitted 22 November, 2021; v1 submitted 21 July, 2021; originally announced July 2021.

    Comments: The first two authors contributed equally, 5 pages, 3 figures, accepted by DCASE2021 Workshop

  21. arXiv:2104.03815  [pdf, other

    cs.CL cs.SD eess.AS

    Exploring Machine Speech Chain for Domain Adaptation and Few-Shot Speaker Adaptation

    Authors: Fengpeng Yue, Yan Deng, Lei He, Tom Ko

    Abstract: Machine Speech Chain, which integrates both end-to-end (E2E) automatic speech recognition (ASR) and text-to-speech (TTS) into one circle for joint training, has been proven to be effective in data augmentation by leveraging large amounts of unpaired data. In this paper, we explore the TTS->ASR pipeline in speech chain to do domain adaptation for both neural TTS and E2E ASR models, with only text d… ▽ More

    Submitted 8 April, 2021; originally announced April 2021.

  22. arXiv:1811.03301  [pdf, other

    eess.SY math.OC

    Dynamic Security Analysis of Power Systems by a Sampling-Based Algorithm

    Authors: Qiang Wu, T. John Koo, Yoshihiko Susuki

    Abstract: Dynamic security analysis is an important problem of power systems on ensuring safe operation and stable power supply even when certain faults occur. No matter such faults are caused by vulnerabilities of system components, physical attacks, or cyber-attacks that are more related to cyber-security, they eventually affect the physical stability of a power system. Examples of the loss of physical st… ▽ More

    Submitted 8 November, 2018; originally announced November 2018.

    Comments: 23 pages, 12 figures

    Journal ref: ACM Transactions on Cyber-Physical Systems, Vol. 2, No. 2, Article 10, June 2018