Skip to main content

Showing 1–50 of 57 results for author: Wen, Z

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.14869  [pdf, other

    eess.SP

    Cost-Effective RF Fingerprinting Based on Hybrid CVNN-RF Classifier with Automated Multi-Dimensional Early-Exit Strategy

    Authors: Jiayan Gan, Zhixing Du, Qiang Li, Huaizong Shao, **gran Lin, Ye Pan, Zhongyi Wen, Shafei Wang

    Abstract: While the Internet of Things (IoT) technology is booming and offers huge opportunities for information exchange, it also faces unprecedented security challenges. As an important complement to the physical layer security technologies for IoT, radio frequency fingerprinting (RFF) is of great interest due to its difficulty in counterfeiting. Recently, many machine learning (ML)-based RFF algorithms h… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    Comments: Accepted by IEEE Internet of Things Journal

  2. arXiv:2406.10591  [pdf, other

    eess.AS cs.AI cs.CV cs.MM cs.SD

    MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation

    Authors: Ruibo Fu, Shuchen Shi, Hongming Guo, Tao Wang, Chunyu Qiang, Zhengqi Wen, Jianhua Tao, Xin Qi, Yi Lu, Xiaopeng Wang, Zhiyong Wang, Yukun Liu, Xuefei Liu, Shuai Zhang, Guanjun Li

    Abstract: Foley audio, critical for enhancing the immersive experience in multimedia content, faces significant challenges in the AI-generated content (AIGC) landscape. Despite advancements in AIGC technologies for text and image generation, the foley audio dubbing remains rudimentary due to difficulties in cross-modal scene matching and content correlation. Current text-to-audio technology, which relies on… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

  3. arXiv:2406.08112  [pdf, other

    cs.SD cs.AI eess.AS

    Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio

    Authors: Yi Lu, Yuankun Xie, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Zhiyong Wang, Xin Qi, Xuefei Liu, Yongwei Li, Yukun Liu, Xiaopeng Wang, Shuchen Shi

    Abstract: With the proliferation of Large Language Model (LLM) based deepfake audio, there is an urgent need for effective detection methods. Previous deepfake audio generation methods typically involve a multi-step generation process, with the final step using a vocoder to predict the waveform from handcrafted features. However, LLM-based audio is directly generated from discrete neural codecs in an end-to… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH 2024. arXiv admin note: substantial text overlap with arXiv:2405.04880

  4. arXiv:2406.04840  [pdf, other

    cs.SD eess.AS

    TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking

    Authors: Junzuo Zhou, Jiangyan Yi, Tao Wang, Jianhua Tao, Ye Bai, Chu Yuan Zhang, Yong Ren, Zhengqi Wen

    Abstract: Various threats posed by the progress in text-to-speech (TTS) have prompted the need to reliably trace synthesized speech. However, contemporary approaches to this task involve adding watermarks to the audio separately after generation, a process that hurts both speech quality and watermark imperceptibility. In addition, these approaches are limited in robustness and flexibility. To address these… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: acceped by interspeech 2024

  5. arXiv:2406.04683  [pdf, other

    cs.SD eess.AS

    PPPR: Portable Plug-in Prompt Refiner for Text to Audio Generation

    Authors: Shuchen Shi, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Tao Wang, Chunyu Qiang, Yi Lu, Xin Qi, Xuefei Liu, Yukun Liu, Yongwei Li, Zhiyong Wang, Xiaopeng Wang

    Abstract: Text-to-Audio (TTA) aims to generate audio that corresponds to the given text description, playing a crucial role in media production. The text descriptions in TTA datasets lack rich variations and diversity, resulting in a drop in TTA model performance when faced with complex text. To address this issue, we propose a method called Portable Plug-in Prompt Refiner, which utilizes rich knowledge abo… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: accepted by INTERSPEECH2024

  6. arXiv:2406.03247  [pdf, other

    cs.SD eess.AS

    Genuine-Focused Learning using Mask AutoEncoder for Generalized Fake Audio Detection

    Authors: Xiaopeng Wang, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Yuankun Xie, Yukun Liu, Jianhua Tao, Xuefei Liu, Yongwei Li, Xin Qi, Yi Lu, Shuchen Shi

    Abstract: The generalization of Fake Audio Detection (FAD) is critical due to the emergence of new spoofing techniques. Traditional FAD methods often focus solely on distinguishing between genuine and known spoofed audio. We propose a Genuine-Focused Learning (GFL) framework guided, aiming for highly generalized FAD, called GFL-FAD. This method incorporates a Counterfactual Reasoning Enhanced Representation… ▽ More

    Submitted 9 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH 2024

  7. arXiv:2406.03240  [pdf, other

    cs.SD cs.AI eess.AS

    Generalized Source Tracing: Detecting Novel Audio Deepfake Algorithm with Real Emphasis and Fake Dispersion Strategy

    Authors: Yuankun Xie, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Xiaopeng Wang, Haonnan Cheng, Long Ye, Jianhua Tao

    Abstract: With the proliferation of deepfake audio, there is an urgent need to investigate their attribution. Current source tracing methods can effectively distinguish in-distribution (ID) categories. However, the rapid evolution of deepfake algorithms poses a critical challenge in the accurate identification of out-of-distribution (OOD) novel deepfake algorithms. In this paper, we propose Real Emphasis an… ▽ More

    Submitted 8 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH 2024

  8. arXiv:2406.03237  [pdf, other

    cs.SD eess.AS

    Generalized Fake Audio Detection via Deep Stable Learning

    Authors: Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Yuankun Xie, Yukun Liu, Xiaopeng Wang, Xuefei Liu, Yongwei Li, Jianhua Tao, Yi Lu, Xin Qi, Shuchen Shi

    Abstract: Although current fake audio detection approaches have achieved remarkable success on specific datasets, they often fail when evaluated with datasets from different distributions. Previous studies typically address distribution shift by focusing on using extra data or applying extra loss restrictions during training. However, these methods either require a substantial amount of data or complicate t… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: accepted by INTERSPEECH2024

  9. arXiv:2405.04880  [pdf, other

    cs.SD cs.AI eess.AS

    The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio

    Authors: Yuankun Xie, Yi Lu, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Jianhua Tao, Xin Qi, Xiaopeng Wang, Yukun Liu, Haonan Cheng, Long Ye, Yi Sun

    Abstract: With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for generalized detection methods. ALM-based deepfake audio currently exhibits widespread, high deception, and type versatility, posing a significant challenge to current audio deepfake detection (ADD) models trained solely on vocoded data. To effectively detect ALM-based deepfake audio, we focus on… ▽ More

    Submitted 15 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

  10. arXiv:2312.16998  [pdf, other

    eess.IV cs.CV

    Deep Unfolding Network with Spatial Alignment for multi-modal MRI reconstruction

    Authors: Hao Zhang, Qi Wang, Jun Shi, Shihui Ying, Zhijie Wen

    Abstract: Multi-modal Magnetic Resonance Imaging (MRI) offers complementary diagnostic information, but some modalities are limited by the long scanning time. To accelerate the whole acquisition process, MRI reconstruction of one modality from highly undersampled k-space data with another fully-sampled reference modality is an efficient solution. However, the misalignment between modalities, which is common… ▽ More

    Submitted 28 December, 2023; originally announced December 2023.

  11. arXiv:2310.11531  [pdf, ps, other

    cs.LG cs.AI eess.SY stat.ML

    Efficient Online Learning with Offline Datasets for Infinite Horizon MDPs: A Bayesian Approach

    Authors: Dengwang Tang, Rahul Jain, Botao Hao, Zheng Wen

    Abstract: In this paper, we study the problem of efficient online reinforcement learning in the infinite horizon setting when there is an offline dataset to start with. We assume that the offline dataset is generated by an expert but with unknown level of competence, i.e., it is not perfect and not necessarily using the optimal policy. We show that if the learning agent models the behavioral policy (paramet… ▽ More

    Submitted 1 February, 2024; v1 submitted 17 October, 2023; originally announced October 2023.

    Comments: 22 pages

    MSC Class: 93E35

  12. Dual-Branch Knowledge Distillation for Noise-Robust Synthetic Speech Detection

    Authors: Cunhang Fan, Mingming Ding, Jianhua Tao, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Zhao Lv

    Abstract: Most research in synthetic speech detection (SSD) focuses on improving performance on standard noise-free datasets. However, in actual situations, noise interference is usually present, causing significant performance degradation in SSD systems. To improve noise robustness, this paper proposes a dual-branch knowledge distillation synthetic speech detection (DKDSSD) method. Specifically, a parallel… ▽ More

    Submitted 16 April, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

  13. arXiv:2305.13774  [pdf, other

    cs.SD eess.AS

    ADD 2023: the Second Audio Deepfake Detection Challenge

    Authors: Jiangyan Yi, Jianhua Tao, Ruibo Fu, Xinrui Yan, Chenglong Wang, Tao Wang, Chu Yuan Zhang, Xiaohui Zhang, Yan Zhao, Yong Ren, Le Xu, Junzuo Zhou, Hao Gu, Zhengqi Wen, Shan Liang, Zheng Lian, Shuai Nie, Haizhou Li

    Abstract: Audio deepfake detection is an emerging topic in the artificial intelligence community. The second Audio Deepfake Detection Challenge (ADD 2023) aims to spur researchers around the world to build new innovative technologies that can further accelerate and foster research on detecting and analyzing deepfake speech utterances. Different from previous challenges (e.g. ADD 2022), ADD 2023 focuses on s… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

  14. arXiv:2305.02774  [pdf, other

    eess.IV cs.CV physics.med-ph

    Spatial and Modal Optimal Transport for Fast Cross-Modal MRI Reconstruction

    Authors: Qi Wang, Zhijie Wen, Jun Shi, Qian Wang, Dinggang Shen, Shihui Ying

    Abstract: Multi-modal magnetic resonance imaging (MRI) plays a crucial role in comprehensive disease diagnosis in clinical medicine. However, acquiring certain modalities, such as T2-weighted images (T2WIs), is time-consuming and prone to be with motion artifacts. It negatively impacts subsequent multi-modal image analysis. To address this issue, we propose an end-to-end deep learning framework that utilize… ▽ More

    Submitted 21 May, 2024; v1 submitted 4 May, 2023; originally announced May 2023.

  15. arXiv:2303.01211  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    Learning From Yourself: A Self-Distillation Method for Fake Speech Detection

    Authors: Jun Xue, Cunhang Fan, Jiangyan Yi, Chenglong Wang, Zhengqi Wen, Dan Zhang, Zhao Lv

    Abstract: In this paper, we propose a novel self-distillation method for fake speech detection (FSD), which can significantly improve the performance of FSD without increasing the model complexity. For FSD, some fine-grained information is very important, such as spectrogram defects, mute segments, and so on, which are often perceived by shallow networks. However, shallow networks have much noise, which can… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: Accepted by ICASSP 2023

  16. arXiv:2301.03801  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion

    Authors: Haogeng Liu, Tao Wang, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Jianhua Tao

    Abstract: Text-to-speech (TTS) and voice conversion (VC) are two different tasks both aiming at generating high quality speaking voice according to different input modality. Due to their similarity, this paper proposes UnifySpeech, which brings TTS and VC into a unified framework for the first time. The model is based on the assumption that speech can be decoupled into three independent components: content… ▽ More

    Submitted 10 January, 2023; originally announced January 2023.

  17. arXiv:2212.10191  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Emotion Selectable End-to-End Text-based Speech Editing

    Authors: Tao Wang, Jiangyan Yi, Ruibo Fu, Jianhua Tao, Zhengqi Wen, Chu Yuan Zhang

    Abstract: Text-based speech editing allows users to edit speech by intuitively cutting, copying, and pasting text to speed up the process of editing speech. In the previous work, CampNet (context-aware mask prediction network) is proposed to realize text-based speech editing, significantly improving the quality of edited speech. This paper aims at a new task: adding emotional effect to the editing speech du… ▽ More

    Submitted 20 December, 2022; originally announced December 2022.

    Comments: Under review, 12 pages, 11 figures, demo page is available at https://hairuo55.github.io/Emo-CampNet/

  18. arXiv:2211.05910  [pdf, other

    eess.IV cs.CV

    Efficient and Accurate Quantized Image Super-Resolution on Mobile NPUs, Mobile AI & AIM 2022 challenge: Report

    Authors: Andrey Ignatov, Radu Timofte, Maurizio Denna, Abdel Younes, Ganzorig Gankhuyag, **gang Huh, Myeong Kyun Kim, Kihwan Yoon, Hyeon-Cheol Moon, Seungho Lee, Yoonsik Choe, **woo Jeong, Sungjei Kim, Maciej Smyl, Tomasz Latkowski, Pawel Kubik, Michal Sokolski, Yujie Ma, Jiahao Chao, Zhou Zhou, Hongfan Gao, Zhengfeng Yang, Zhenbing Zeng, Zhengyang Zhuge, Chenghua Li , et al. (71 additional authors not shown)

    Abstract: Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose… ▽ More

    Submitted 7 November, 2022; originally announced November 2022.

    Comments: arXiv admin note: text overlap with arXiv:2105.07825, arXiv:2105.08826, arXiv:2211.04470, arXiv:2211.03885, arXiv:2211.05256

  19. arXiv:2210.11429  [pdf

    cs.SD eess.AS

    Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS

    Authors: Chunyu Qiang, Jianhua Tao, Ruibo Fu, Zhengqi Wen, Jiangyan Yi, Tao Wang, Shiming Wang

    Abstract: Current end-to-end code-switching Text-to-Speech (TTS) can already generate high quality two languages speech in the same utterance with single speaker bilingual corpora. When the speakers of the bilingual corpora are different, the naturalness and consistency of the code-switching TTS will be poor. The cross-lingual embedding layers structure we proposed makes similar syllables in different langu… ▽ More

    Submitted 20 October, 2022; originally announced October 2022.

    Comments: accepted in ISCSLP 2021

  20. arXiv:2208.11609  [pdf, other

    eess.IV cs.CV

    Fast Nearest Convolution for Real-Time Efficient Image Super-Resolution

    Authors: Ziwei Luo, Youwei Li, Lei Yu, Qi Wu, Zhihong Wen, Haoqiang Fan, Shuaicheng Liu

    Abstract: Deep learning-based single image super-resolution (SISR) approaches have drawn much attention and achieved remarkable success on modern advanced GPUs. However, most state-of-the-art methods require a huge number of parameters, memories, and computational resources, which usually show inferior inference times when applying them to current mobile device CPUs/NPUs. In this paper, we propose a simple… ▽ More

    Submitted 24 August, 2022; originally announced August 2022.

    Comments: AIM & Mobile AI 2022

  21. arXiv:2208.01214  [pdf, other

    cs.SD cs.LG eess.AS

    Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features

    Authors: Jun Xue, Cunhang Fan, Zhao Lv, Jianhua Tao, Jiangyan Yi, Chengshi Zheng, Zhengqi Wen, Minmin Yuan, Shegang Shao

    Abstract: Recently, pioneer research works have proposed a large number of acoustic features (log power spectrogram, linear frequency cepstral coefficients, constant Q cepstral coefficients, etc.) for audio deepfake detection, obtaining good performance, and showing that different subbands have different contributions to audio deepfake detection. However, this lacks an explanation of the specific informatio… ▽ More

    Submitted 1 August, 2022; originally announced August 2022.

  22. arXiv:2205.05675  [pdf, other

    cs.CV eess.IV

    NTIRE 2022 Challenge on Efficient Super-Resolution: Methods and Results

    Authors: Yawei Li, Kai Zhang, Radu Timofte, Luc Van Gool, Fangyuan Kong, Mingxi Li, Songwei Liu, Zongcai Du, Ding Liu, Chenhui Zhou, **gyi Chen, Qingrui Han, Zheyuan Li, Yingqi Liu, Xiangyu Chen, Haoming Cai, Yu Qiao, Chao Dong, Long Sun, **shan Pan, Yi Zhu, Zhikai Zong, Xiaoxiao Liu, Zheng Hui, Tao Yang , et al. (86 additional authors not shown)

    Abstract: This paper reviews the NTIRE 2022 challenge on efficient single image super-resolution with focus on the proposed solutions and results. The task of the challenge was to super-resolve an input image with a magnification factor of $\times$4 based on pairs of low and corresponding high resolution images. The aim was to design a network for single image super-resolution that achieved improvement of e… ▽ More

    Submitted 11 May, 2022; originally announced May 2022.

    Comments: Validation code of the baseline model is available at https://github.com/ofsoundof/IMDN. Validation of all submitted models is available at https://github.com/ofsoundof/NTIRE2022_ESR

  23. arXiv:2203.02678  [pdf, other

    cs.SD cs.CL eess.AS

    NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation

    Authors: Tao Wang, Ruibo Fu, Jiangyan Yi, Jianhua Tao, Zhengqi Wen

    Abstract: The traditional vocoders have the advantages of high synthesis efficiency, strong interpretability, and speech editability, while the neural vocoders have the advantage of high synthesis quality. To combine the advantages of two vocoders, inspired by the traditional deterministic plus stochastic model, this paper proposes a novel neural vocoder named NeuralDPS which can retain high speech quality… ▽ More

    Submitted 5 March, 2022; originally announced March 2022.

    Comments: 15 pages, 12 figures; Accepted to TASLP. Demo page https://hairuo55.github.io/NeuralDPS. arXiv admin note: text overlap with arXiv:1906.09573 by other authors

  24. arXiv:2202.09950  [pdf, other

    cs.SD cs.CL eess.AS

    CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing

    Authors: Tao Wang, Jiangyan Yi, Ruibo Fu, Jianhua Tao, Zhengqi Wen

    Abstract: The text-based speech editor allows the editing of speech through intuitive cutting, copying, and pasting operations to speed up the process of editing speech. However, the major drawback of current systems is that edited speech often sounds unnatural due to cut-copy-paste operation. In addition, it is not obvious how to synthesize records according to a new word not appearing in the transcript. T… ▽ More

    Submitted 22 March, 2022; v1 submitted 20 February, 2022; originally announced February 2022.

    Comments: under review, 14 pages, 14 figures, demo page is available at https://hairuo55.github.io/CampNet

  25. arXiv:2202.08433  [pdf, ps, other

    cs.SD cs.LG eess.AS

    ADD 2022: the First Audio Deep Synthesis Detection Challenge

    Authors: Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Ye Bai, Cunhang Fan, Shan Liang, Shiming Wang, Shuai Zhang, Xinrui Yan, Le Xu, Zhengqi Wen, Haizhou Li, Zheng Lian, Bin Liu

    Abstract: Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021. However, the recent shared tasks have not covered many real-life and challenging scenarios. The first Audio Deep synthesis Detection challenge (ADD) was motivated to fill in the gap. The ADD 2022 includes three tracks: low-quality fake audio detection (LF), partially fake audio detection (PF) and audio fake gam… ▽ More

    Submitted 26 February, 2022; v1 submitted 16 February, 2022; originally announced February 2022.

    Comments: Accepted by ICASSP 2022

  26. arXiv:2202.07907  [pdf, other

    cs.SD cs.LG eess.AS

    Singing-Tacotron: Global duration control attention and dynamic filter for End-to-end singing voice synthesis

    Authors: Tao Wang, Ruibo Fu, Jiangyan Yi, Jianhua Tao, Zhengqi Wen

    Abstract: End-to-end singing voice synthesis (SVS) is attractive due to the avoidance of pre-aligned data. However, the auto learned alignment of singing voice with lyrics is difficult to match the duration information in musical score, which will lead to the model instability or even failure to synthesize voice. To learn accurate alignment information automatically, this paper proposes an end-to-end SVS fr… ▽ More

    Submitted 16 February, 2022; originally announced February 2022.

    Comments: 5 pages, 7 figures

  27. arXiv:2111.13759  [pdf

    cs.LG eess.SY

    Dynamic Analysis of Nonlinear Civil Engineering Structures using Artificial Neural Network with Adaptive Training

    Authors: Xiao Pan, Zhizhao Wen, T. Y. Yang

    Abstract: Dynamic analysis of structures subjected to earthquake excitation is a time-consuming process, particularly in the case of extremely small time step required, or in the presence of high geometric and material nonlinearity. Performing parametric studies in such cases is even more tedious. The advancement of computer graphics hardware in recent years enables efficient training of artificial neural n… ▽ More

    Submitted 21 November, 2021; originally announced November 2021.

    Comments: 12 pages, 10 figures, accepted in the 17th World Conference on Earthquake Engineering, 17WCEE, Sendai, Japan, 2021

    Report number: Report-no:9c-0025, 17 WCEE Proceedings

  28. arXiv:2104.02882  [pdf, other

    eess.AS cs.CL cs.SD

    FSR: Accelerating the Inference Process of Transducer-Based Models by Applying Fast-Skip Regularization

    Authors: Zhengkun Tian, Jiangyan Yi, Ye Bai, Jianhua Tao, Shuai Zhang, Zhengqi Wen

    Abstract: Transducer-based models, such as RNN-Transducer and transformer-transducer, have achieved great success in speech recognition. A typical transducer model decodes the output sequence conditioned on the current acoustic state and previously predicted tokens step by step. Statistically, The number of blank tokens in the prediction results accounts for nearly 90\% of all tokens. It takes a lot of comp… ▽ More

    Submitted 6 April, 2021; originally announced April 2021.

    Comments: Submitted to INTERSPEECH2021

  29. TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech Recognition

    Authors: Zhengkun Tian, Jiangyan Yi, Jianhua Tao, Ye Bai, Shuai Zhang, Zhengqi Wen, Xuefei Liu

    Abstract: The autoregressive (AR) models, such as attention-based encoder-decoder models and RNN-Transducer, have achieved great success in speech recognition. They predict the output sequence conditioned on the previous tokens and acoustic encoded states, which is inefficient on GPUs. The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire… ▽ More

    Submitted 3 April, 2021; originally announced April 2021.

    Comments: Submitted to Interspeech2021

  30. arXiv:2102.07594  [pdf, other

    cs.CL cs.AI eess.AS

    Fast End-to-End Speech Recognition via Non-Autoregressive Models and Cross-Modal Knowledge Transferring from BERT

    Authors: Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Zhengqi Wen, Shuai Zhang

    Abstract: Attention-based encoder-decoder (AED) models have achieved promising performance in speech recognition. However, because the decoder predicts text tokens (such as characters or words) in an autoregressive manner, it is difficult for an AED model to predict all tokens in parallel. This makes the inference speed relatively slow. We believe that because the encoder already captures the whole speech u… ▽ More

    Submitted 29 August, 2021; v1 submitted 15 February, 2021; originally announced February 2021.

    Comments: 14 pages, 7 figures

  31. arXiv:2011.08146  [pdf, other

    eess.SP cs.LG

    Temporal Dynamic Model for Resting State fMRI Data: A Neural Ordinary Differential Equation approach

    Authors: Zheyu Wen

    Abstract: The objective of this paper is to provide a temporal dynamic model for resting state functional Magnetic Resonance Imaging (fMRI) trajectory to predict future brain images based on the given sequence. To this end, we came up with the model that takes advantage of representation learning and Neural Ordinary Differential Equation (Neural ODE) to compress the fMRI image data into latent representatio… ▽ More

    Submitted 16 November, 2020; originally announced November 2020.

    Comments: 13 pages, 20 figures, preprint work

  32. arXiv:2011.05591  [pdf, other

    cs.SD cs.LG eess.AS

    Deep Time Delay Neural Network for Speech Enhancement with Full Data Learning

    Authors: Cunhang Fan, Bin Liu, Jianhua Tao, Jiangyan Yi, Zhengqi Wen, Leichao Song

    Abstract: Recurrent neural networks (RNNs) have shown significant improvements in recent years for speech enhancement. However, the model complexity and inference time cost of RNNs are much higher than deep feed-forward neural networks (DNNs). Therefore, these limit the applications of speech enhancement. This paper proposes a deep time delay neural network (TDNN) for speech enhancement with full data learn… ▽ More

    Submitted 11 November, 2020; originally announced November 2020.

    Comments: Accepted by ISCSLP 2021

  33. arXiv:2011.04249  [pdf, other

    cs.SD cs.CL eess.AS

    Gated Recurrent Fusion with Joint Training Framework for Robust End-to-End Speech Recognition

    Authors: Cunhang Fan, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Bin Liu, Zhengqi Wen

    Abstract: The joint training framework for speech enhancement and recognition methods have obtained quite good performances for robust end-to-end automatic speech recognition (ASR). However, these methods only utilize the enhanced feature as the input of the speech recognition component, which are affected by the speech distortion problem. In order to address this problem, this paper proposes a gated recurr… ▽ More

    Submitted 9 November, 2020; originally announced November 2020.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing

  34. arXiv:2010.14798  [pdf, other

    cs.SD cs.CL eess.AS

    Decoupling Pronunciation and Language for End-to-end Code-switching Automatic Speech Recognition

    Authors: Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Ye Bai, Jianhua Tao, Zhengqi wen

    Abstract: Despite the recent significant advances witnessed in end-to-end (E2E) ASR system for code-switching, hunger for audio-text paired data limits the further improvement of the models' performance. In this paper, we propose a decoupled transformer model to use monolingual paired data and unpaired text data to alleviate the problem of code-switching data shortage. The model is decoupled into two parts:… ▽ More

    Submitted 28 October, 2020; originally announced October 2020.

    Comments: 5 pages, 1 figures

  35. arXiv:2010.14791  [pdf, other

    eess.AS

    One In A Hundred: Select The Best Predicted Sequence from Numerous Candidates for Streaming Speech Recognition

    Authors: Zhengkun Tian, Jiangyan Yi, Ye Bai, Jianhua Tao, Shuai Zhang, Zhengqi Wen

    Abstract: The RNN-Transducers and improved attention-based encoder-decoder models are widely applied to streaming speech recognition. Compared with these two end-to-end models, the CTC model is more efficient in training and inference. However, it cannot capture the linguistic dependencies between the output tokens. Inspired by the success of two-pass end-to-end models, we introduce a transformer decoder an… ▽ More

    Submitted 3 April, 2021; v1 submitted 28 October, 2020; originally announced October 2020.

  36. arXiv:2009.03140  [pdf, ps, other

    eess.SP cs.LG

    Edge Learning with Unmanned Ground Vehicle: Joint Path, Energy and Sample Size Planning

    Authors: Dan Liu, Shuai Wang, Zhigang Wen, Lei Cheng, Miaowen Wen, Yik-Chung Wu

    Abstract: Edge learning (EL), which uses edge computing as a platform to execute machine learning algorithms, is able to fully exploit the massive sensing data generated by Internet of Things (IoT). However, due to the limited transmit power at IoT devices, collecting the sensing data in EL systems is a challenging task. To address this challenge, this paper proposes to integrate unmanned ground vehicle (UG… ▽ More

    Submitted 7 September, 2020; originally announced September 2020.

    Comments: 16 pages, 6 figures, to appear in IEEE Internet of Things Journal

  37. arXiv:2008.03942  [pdf, other

    eess.SP

    Joint Bandwidth Allocation and Path Selection in WANs with Path Cardinality Constraints

    Authors: **xin Wang, Fan Zhang, Zhonglin Xie, Gong Zhang, Zaiwen Wen

    Abstract: In this paper, we study a joint bandwidth allocation and path selection problem via solving a multi-objective minimization problem under the path cardinality constraints, namely MOPC. Our problem formulation captures various types of objectives including the proportional fairness, the total completion time, as well as the worst-case link utilization ratio. Such an optimization problem is very chal… ▽ More

    Submitted 10 August, 2020; originally announced August 2020.

    Comments: Submitted to IEEE TSP and being under review

  38. arXiv:2005.07903  [pdf, other

    eess.AS cs.CL cs.SD

    Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition

    Authors: Zhengkun Tian, Jiangyan Yi, Jianhua Tao, Ye Bai, Shuai Zhang, Zhengqi Wen

    Abstract: Non-autoregressive transformer models have achieved extremely fast inference speed and comparable performance with autoregressive sequence-to-sequence models in neural machine translation. Most of the non-autoregressive transformers decode the target sequence from a predefined-length mask sequence. If the predefined length is too long, it will cause a lot of redundant calculations. If the predefin… ▽ More

    Submitted 16 May, 2020; originally announced May 2020.

    Comments: 5 pages

  39. arXiv:2005.04862  [pdf, other

    eess.AS cs.CL

    Listen Attentively, and Spell Once: Whole Sentence Generation via a Non-Autoregressive Architecture for Low-Latency Speech Recognition

    Authors: Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Zhengqi Wen, Shuai Zhang

    Abstract: Although attention based end-to-end models have achieved promising performance in speech recognition, the multi-pass forward computation in beam-search increases inference time cost, which limits their practical applications. To address this issue, we propose a non-autoregressive end-to-end speech recognition system called LASO (listen attentively, and spell once). Because of the non-autoregressiv… ▽ More

    Submitted 5 August, 2020; v1 submitted 11 May, 2020; originally announced May 2020.

    Comments: accepted by INTERSPEECH2020

  40. arXiv:2004.02420  [pdf, other

    eess.AS cs.LG cs.SD

    Simultaneous Denoising and Dereverberation Using Deep Embedding Features

    Authors: Cunhang Fan, Jianhua Tao, Bin Liu, Jiangyan Yi, Zhengqi Wen

    Abstract: Monaural speech dereverberation is a very challenging task because no spatial cues can be used. When the additive noises exist, this task becomes more challenging. In this paper, we propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features, which is based on the deep clustering (DC). DC is a state-of-the-art method for speech separation tha… ▽ More

    Submitted 6 April, 2020; originally announced April 2020.

  41. arXiv:2003.07544  [pdf, other

    eess.AS cs.MM cs.SD

    Deep Attention Fusion Feature for Speech Separation with End-to-End Post-filter Method

    Authors: Cunhang Fan, Jianhua Tao, Bin Liu, Jiangyan Yi, Zhengqi Wen, Xuefei Liu

    Abstract: In this paper, we propose an end-to-end post-filter method with deep attention fusion features for monaural speaker-independent speech separation. At first, a time-frequency domain speech separation method is applied as the pre-separation stage. The aim of pre-separation stage is to separate the mixture preliminarily. Although this stage can separate the mixture, it still contains the residual int… ▽ More

    Submitted 17 March, 2020; originally announced March 2020.

    Comments: ACCEPTED by IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)

  42. arXiv:2002.01626  [pdf, other

    eess.AS cs.LG cs.SD

    Spatial and spectral deep attention fusion for multi-channel speech separation using deep embedding features

    Authors: Cunhang Fan, Bin Liu, Jianhua Tao, Jiangyan Yi, Zhengqi Wen

    Abstract: Multi-channel deep clustering (MDC) has acquired a good performance for speech separation. However, MDC only applies the spatial features as the additional information. So it is difficult to learn mutual relationship between spatial and spectral features. Besides, the training objective of MDC is defined at embedding vectors, rather than real separated sources, which may damage the separation perf… ▽ More

    Submitted 4 February, 2020; originally announced February 2020.

  43. arXiv:1912.04843  [pdf

    eess.SP

    A novel generative reverse net assisted evolution algorithm for expensive-computational optimizations

    Authors: Yu Li, Hu Wang, Ziming Wen, Xin Wang

    Abstract: Simulation-based optimization is a useful method for practical design problems. However, it is difficult for complicated problems due to expensive-computational costs. A popular way to overcome this issue is to use a surrogate model to save the cost. Nevertheless, limited design parameters those are input to traditional surrogate models can difficultly represent the whole design problem, which mig… ▽ More

    Submitted 8 December, 2019; originally announced December 2019.

  44. arXiv:1912.02958  [pdf, other

    eess.AS cs.CL cs.LG

    Synchronous Transformers for End-to-End Speech Recognition

    Authors: Zhengkun Tian, Jiangyan Yi, Ye Bai, Jianhua Tao, Shuai Zhang, Zhengqi Wen

    Abstract: For most of the attention-based sequence-to-sequence models, the decoder predicts the output sequence conditioned on the entire input sequence processed by the encoder. The asynchronous problem between the encoding and decoding makes these models difficult to be applied for online speech recognition. In this paper, we propose a model named synchronous transformer to address this problem, which can… ▽ More

    Submitted 23 February, 2020; v1 submitted 5 December, 2019; originally announced December 2019.

    Comments: Accepted by ICASSP 2020

  45. arXiv:1912.01777  [pdf, other

    eess.AS cs.CL cs.SD

    Integrating Knowledge into End-to-End Speech Recognition from External Text-Only Data

    Authors: Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengqi Wen, Zhengkun Tian, Shuai Zhang

    Abstract: Attention-based encoder-decoder (AED) models have achieved promising performance in speech recognition. However, because of the end-to-end training, an AED model is usually trained with speech-text paired data. It is challenging to incorporate external text-only data into AED models. Another issue of the AED model is that it does not use the right context of a text token while predicting the token… ▽ More

    Submitted 15 March, 2021; v1 submitted 3 December, 2019; originally announced December 2019.

    Comments: Submitted TASLP

  46. arXiv:1910.11269  [pdf, other

    cs.SD eess.AS

    Towards Fine-Grained Prosody Control for Voice Conversion

    Authors: Zheng Lian, Zhengqi Wen

    Abstract: In a typical voice conversion system, prior works utilize various acoustic features (e.g., the pitch, voiced/unvoiced flag, aperiodicity) of the source speech to control the prosody of generated waveform. However, the prosody is related with many factors, such as the intonation, stress and rhythm. It is a challenging task to perfectly describe the prosody through acoustic features. To deal with th… ▽ More

    Submitted 27 May, 2020; v1 submitted 24 October, 2019; originally announced October 2019.

  47. Self-Attention Transducers for End-to-End Speech Recognition

    Authors: Zhengkun Tian, Jiangyan Yi, Jianhua Tao, Ye Bai, Zhengqi Wen

    Abstract: Recurrent neural network transducers (RNN-T) have been successfully applied in end-to-end speech recognition. However, the recurrent structure makes it difficult for parallelization . In this paper, we propose a self-attention transducer (SA-T) for speech recognition. RNNs are replaced with self-attention blocks, which are powerful to model long-term dependencies inside sequences and able to be ef… ▽ More

    Submitted 28 September, 2019; originally announced September 2019.

    Journal ref: Proc. Interspeech 2019, 4395-4399

  48. arXiv:1907.09884  [pdf, other

    cs.SD cs.LG eess.AS

    Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features

    Authors: Cunhang Fan, Bin Liu, Jianhua Tao, Jiangyan Yi, Zhengqi Wen

    Abstract: Deep clustering (DC) and utterance-level permutation invariant training (uPIT) have been demonstrated promising for speaker-independent speech separation. DC is usually formulated as two-step processes: embedding learning and embedding clustering, which results in complex separation pipelines and a huge obstacle in directly optimizing the actual separation objectives. As for uPIT, it only minimize… ▽ More

    Submitted 23 July, 2019; originally announced July 2019.

    Comments: 5 pages, 1 figure, accepted by INTERSPEECH 2019

  49. arXiv:1907.09006  [pdf, other

    eess.AS cs.CL cs.SD

    Forward-Backward Decoding for Regularizing End-to-End TTS

    Authors: Yibin Zheng, Xi Wang, Lei He, Shifeng Pan, Frank K. Soong, Zhengqi Wen, Jianhua Tao

    Abstract: Neural end-to-end TTS can generate very high-quality synthesized speech, and even close to human recording within similar domain text. However, it performs unsatisfactory when scaling it to challenging test sets. One concern is that the encoder-decoder with attention-based network adopts autoregressive generative sequence model with the limitation of "exposure bias" To address this issue, we propo… ▽ More

    Submitted 18 July, 2019; originally announced July 2019.

    Comments: Accepted by INTERSPEECH2019. arXiv admin note: text overlap with arXiv:1808.04064, arXiv:1804.05374 by other authors

  50. arXiv:1907.06017  [pdf, other

    eess.AS cs.CL

    Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition

    Authors: Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Zhengqi Wen

    Abstract: Integrating an external language model into a sequence-to-sequence speech recognition system is non-trivial. Previous works utilize linear interpolation or a fusion network to integrate external language models. However, these approaches introduce external components, and increase decoding computation. In this paper, we instead propose a knowledge distillation based training approach to integratin… ▽ More

    Submitted 13 July, 2019; originally announced July 2019.

    Comments: 5 pages, 3 figures, accepted by INTERSPEECH 2019