Skip to main content

Showing 1–50 of 85 results for author: Tao, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.10591  [pdf, other

    eess.AS cs.AI cs.CV cs.MM cs.SD

    MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation

    Authors: Ruibo Fu, Shuchen Shi, Hongming Guo, Tao Wang, Chunyu Qiang, Zhengqi Wen, Jianhua Tao, Xin Qi, Yi Lu, Xiaopeng Wang, Zhiyong Wang, Yukun Liu, Xuefei Liu, Shuai Zhang, Guanjun Li

    Abstract: Foley audio, critical for enhancing the immersive experience in multimedia content, faces significant challenges in the AI-generated content (AIGC) landscape. Despite advancements in AIGC technologies for text and image generation, the foley audio dubbing remains rudimentary due to difficulties in cross-modal scene matching and content correlation. Current text-to-audio technology, which relies on… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

  2. arXiv:2406.08112  [pdf, other

    cs.SD cs.AI eess.AS

    Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio

    Authors: Yi Lu, Yuankun Xie, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Zhiyong Wang, Xin Qi, Xuefei Liu, Yongwei Li, Yukun Liu, Xiaopeng Wang, Shuchen Shi

    Abstract: With the proliferation of Large Language Model (LLM) based deepfake audio, there is an urgent need for effective detection methods. Previous deepfake audio generation methods typically involve a multi-step generation process, with the final step using a vocoder to predict the waveform from handcrafted features. However, LLM-based audio is directly generated from discrete neural codecs in an end-to… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH 2024. arXiv admin note: substantial text overlap with arXiv:2405.04880

  3. arXiv:2406.06086  [pdf, other

    cs.SD eess.AS

    RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection

    Authors: Yujie Chen, Jiangyan Yi, Jun Xue, Chenglong Wang, Xiaohui Zhang, Shunbo Dong, Siding Zeng, Jianhua Tao, Lv Zhao, Cunhang Fan

    Abstract: Fake artefacts for discriminating between bonafide and fake audio can exist in both short- and long-range segments. Therefore, combining local and global feature information can effectively discriminate between bonafide and fake audio. This paper proposes an end-to-end bidirectional state space model, named RawBMamba, to capture both short- and long-range discriminative information for audio deepf… ▽ More

    Submitted 18 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  4. arXiv:2406.04840  [pdf, other

    cs.SD eess.AS

    TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking

    Authors: Junzuo Zhou, Jiangyan Yi, Tao Wang, Jianhua Tao, Ye Bai, Chu Yuan Zhang, Yong Ren, Zhengqi Wen

    Abstract: Various threats posed by the progress in text-to-speech (TTS) have prompted the need to reliably trace synthesized speech. However, contemporary approaches to this task involve adding watermarks to the audio separately after generation, a process that hurts both speech quality and watermark imperceptibility. In addition, these approaches are limited in robustness and flexibility. To address these… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: acceped by interspeech 2024

  5. arXiv:2406.04683  [pdf, other

    cs.SD eess.AS

    PPPR: Portable Plug-in Prompt Refiner for Text to Audio Generation

    Authors: Shuchen Shi, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Tao Wang, Chunyu Qiang, Yi Lu, Xin Qi, Xuefei Liu, Yukun Liu, Yongwei Li, Zhiyong Wang, Xiaopeng Wang

    Abstract: Text-to-Audio (TTA) aims to generate audio that corresponds to the given text description, playing a crucial role in media production. The text descriptions in TTA datasets lack rich variations and diversity, resulting in a drop in TTA model performance when faced with complex text. To address this issue, we propose a method called Portable Plug-in Prompt Refiner, which utilizes rich knowledge abo… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: accepted by INTERSPEECH2024

  6. arXiv:2406.03247  [pdf, other

    cs.SD eess.AS

    Genuine-Focused Learning using Mask AutoEncoder for Generalized Fake Audio Detection

    Authors: Xiaopeng Wang, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Yuankun Xie, Yukun Liu, Jianhua Tao, Xuefei Liu, Yongwei Li, Xin Qi, Yi Lu, Shuchen Shi

    Abstract: The generalization of Fake Audio Detection (FAD) is critical due to the emergence of new spoofing techniques. Traditional FAD methods often focus solely on distinguishing between genuine and known spoofed audio. We propose a Genuine-Focused Learning (GFL) framework guided, aiming for highly generalized FAD, called GFL-FAD. This method incorporates a Counterfactual Reasoning Enhanced Representation… ▽ More

    Submitted 9 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH 2024

  7. arXiv:2406.03240  [pdf, other

    cs.SD cs.AI eess.AS

    Generalized Source Tracing: Detecting Novel Audio Deepfake Algorithm with Real Emphasis and Fake Dispersion Strategy

    Authors: Yuankun Xie, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Xiaopeng Wang, Haonnan Cheng, Long Ye, Jianhua Tao

    Abstract: With the proliferation of deepfake audio, there is an urgent need to investigate their attribution. Current source tracing methods can effectively distinguish in-distribution (ID) categories. However, the rapid evolution of deepfake algorithms poses a critical challenge in the accurate identification of out-of-distribution (OOD) novel deepfake algorithms. In this paper, we propose Real Emphasis an… ▽ More

    Submitted 8 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH 2024

  8. arXiv:2406.03237  [pdf, other

    cs.SD eess.AS

    Generalized Fake Audio Detection via Deep Stable Learning

    Authors: Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Yuankun Xie, Yukun Liu, Xiaopeng Wang, Xuefei Liu, Yongwei Li, Jianhua Tao, Yi Lu, Xin Qi, Shuchen Shi

    Abstract: Although current fake audio detection approaches have achieved remarkable success on specific datasets, they often fail when evaluated with datasets from different distributions. Previous studies typically address distribution shift by focusing on using extra data or applying extra loss restrictions during training. However, these methods either require a substantial amount of data or complicate t… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: accepted by INTERSPEECH2024

  9. arXiv:2405.08596   

    cs.SD eess.AS

    EVDA: Evolving Deepfake Audio Detection Continual Learning Benchmark

    Authors: Xiaohui Zhang, Jiangyan Yi, Jianhua Tao

    Abstract: The rise of advanced large language models such as GPT-4, GPT-4o, and the Claude family has made fake audio detection increasingly challenging. Traditional fine-tuning methods struggle to keep pace with the evolving landscape of synthetic speech, necessitating continual learning approaches that can adapt to new audio while retaining the ability to detect older types. Continual learning, which acts… ▽ More

    Submitted 15 May, 2024; v1 submitted 14 May, 2024; originally announced May 2024.

    Comments: This paper need more modification

  10. arXiv:2405.04880  [pdf, other

    cs.SD cs.AI eess.AS

    The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio

    Authors: Yuankun Xie, Yi Lu, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Jianhua Tao, Xin Qi, Xiaopeng Wang, Yukun Liu, Haonan Cheng, Long Ye, Yi Sun

    Abstract: With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for generalized detection methods. ALM-based deepfake audio currently exhibits widespread, high deception, and type versatility, posing a significant challenge to current audio deepfake detection (ADD) models trained solely on vocoded data. To effectively detect ALM-based deepfake audio, we focus on… ▽ More

    Submitted 15 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

  11. arXiv:2401.05698  [pdf, other

    cs.CV cs.HC cs.MM cs.SD eess.AS

    HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition

    Authors: Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao

    Abstract: Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion-ware intelligent machines. Previous efforts in this area are dominated by the supervised learning paradigm. Despite significant progress, supervised learning is meeting its bottleneck due to the longstanding data scarcity issue in AVER. Motivated by recent advances in… ▽ More

    Submitted 1 April, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

    Comments: Accepted by Information Fusion. The code is available at https://github.com/sunlicai/HiCMAE

    Journal ref: Information Fusion, 2024

  12. arXiv:2312.09651  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection

    Authors: Xiaohui Zhang, Jiangyan Yi, Chenglong Wang, Chuyuan Zhang, Siding Zeng, Jianhua Tao

    Abstract: The rapid evolution of speech synthesis and voice conversion has raised substantial concerns due to the potential misuse of such technology, prompting a pressing need for effective audio deepfake detection mechanisms. Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types. To address this challenge, one of the… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

    Comments: Accepted by the main track The 38th Annual AAAI Conference on Artificial Intelligence (AAAI 2024)

  13. arXiv:2310.18222  [pdf

    eess.IV cs.CV cs.LG

    TBDLNet: a network for classifying multidrug-resistant and drug-sensitive tuberculosis

    Authors: Ziquan Zhu, **g Tao, Shuihua Wang, Xin Zhang, Yudong Zhang

    Abstract: This paper proposes applying a novel deep-learning model, TBDLNet, to recognize CT images to classify multidrug-resistant and drug-sensitive tuberculosis automatically. The pre-trained ResNet50 is selected to extract features. Three randomized neural networks are used to alleviate the overfitting problem. The ensemble of three RNNs is applied to boost the robustness via majority voting. The propos… ▽ More

    Submitted 27 October, 2023; originally announced October 2023.

    Journal ref: Engineering Reports (2023)

  14. Dual-Branch Knowledge Distillation for Noise-Robust Synthetic Speech Detection

    Authors: Cunhang Fan, Mingming Ding, Jianhua Tao, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Zhao Lv

    Abstract: Most research in synthetic speech detection (SSD) focuses on improving performance on standard noise-free datasets. However, in actual situations, noise interference is usually present, causing significant performance degradation in SSD systems. To improve noise robustness, this paper proposes a dual-branch knowledge distillation synthetic speech detection (DKDSSD) method. Specifically, a parallel… ▽ More

    Submitted 16 April, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

  15. arXiv:2310.00014  [pdf, other

    cs.SD eess.AS

    Fewer-token Neural Speech Codec with Time-invariant Codes

    Authors: Yong Ren, Tao Wang, Jiangyan Yi, Le Xu, Jianhua Tao, Chuyuan Zhang, Junzuo Zhou

    Abstract: Language model based text-to-speech (TTS) models, like VALL-E, have gained attention for their outstanding in-context learning capability in zero-shot scenarios. Neural speech codec is a critical component of these models, which can convert speech into discrete token representations. However, excessive token sequences from the codec may negatively affect prediction accuracy and restrict the progre… ▽ More

    Submitted 10 March, 2024; v1 submitted 15 September, 2023; originally announced October 2023.

    Comments: Accepted by ICASSP 2024

  16. arXiv:2309.08166  [pdf, other

    cs.SD eess.AS

    Controllable Residual Speaker Representation for Voice Conversion

    Authors: Le Xu, Jiangyan Yi, Jianhua Tao, Tao Wang, Yong Ren, Rongxiu Zhong

    Abstract: Recently, there have been significant advancements in voice conversion, resulting in high-quality performance. However, there are still two critical challenges in this field. Firstly, current voice conversion methods have limited robustness when encountering unseen speakers. Secondly, they also have limited ability to control timbre representation. To address these challenges, this paper presents… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: submitted to ICASSP 2024

  17. arXiv:2309.07147  [pdf, other

    eess.SP cs.HC cs.LG cs.MM cs.SD eess.AS

    DGSD: Dynamical Graph Self-Distillation for EEG-Based Auditory Spatial Attention Detection

    Authors: Cunhang Fan, Hongyu Zhang, Wei Huang, Jun Xue, Jianhua Tao, Jiangyan Yi, Zhao Lv, Xiaopei Wu

    Abstract: Auditory Attention Detection (AAD) aims to detect target speaker from brain signals in a multi-speaker environment. Although EEG-based AAD methods have shown promising results in recent years, current approaches primarily rely on traditional convolutional neural network designed for processing Euclidean data like images. This makes it challenging to handle EEG signals, which possess non-Euclidean… ▽ More

    Submitted 7 September, 2023; originally announced September 2023.

  18. arXiv:2309.06780  [pdf, other

    cs.SD eess.AS

    Distinguishing Neural Speech Synthesis Models Through Fingerprints in Speech Waveforms

    Authors: Chu Yuan Zhang, Jiangyan Yi, Jianhua Tao, Chenglong Wang, Xinrui Yan

    Abstract: Recent strides in neural speech synthesis technologies, while enjoying widespread applications, have nonetheless introduced a series of challenges, spurring interest in the defence against the threat of misuse and abuse. Notably, source attribution of synthesized speech has value in forensics and intellectual property protection, but prior work in this area has certain limitations in scope. To add… ▽ More

    Submitted 15 June, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

    Comments: Accepted by CCL 2024

  19. arXiv:2308.14970  [pdf, other

    cs.SD eess.AS

    Audio Deepfake Detection: A Survey

    Authors: Jiangyan Yi, Chenglong Wang, Jianhua Tao, Xiaohui Zhang, Chu Yuan Zhang, Yan Zhao

    Abstract: Audio deepfake detection is an emerging active topic. A growing number of literatures have aimed to study deepfake detection algorithms and achieved effective performance, the problem of which is far from being solved. Although there are some review literatures, there has been no comprehensive survey that provides researchers with a systematic overview of these developments with a unified evaluati… ▽ More

    Submitted 28 August, 2023; originally announced August 2023.

  20. arXiv:2308.09944  [pdf, other

    cs.SD eess.AS

    Spatial Reconstructed Local Attention Res2Net with F0 Subband for Fake Speech Detection

    Authors: Cunhang Fan, Jun Xue, Jianhua Tao, Jiangyan Yi, Chenglong Wang, Chengshi Zheng, Zhao Lv

    Abstract: The rhythm of synthetic speech is usually too smooth, which causes that the fundamental frequency (F0) of synthetic speech is significantly different from that of real speech. It is expected that the F0 feature contains the discriminative information for the fake speech detection (FSD) task. In this paper, we propose a novel F0 subband for FSD. In addition, to effectively model the F0 subband so a… ▽ More

    Submitted 19 August, 2023; originally announced August 2023.

  21. arXiv:2308.03300  [pdf, other

    cs.SD cs.LG eess.AS

    Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio Detection

    Authors: Xiaohui Zhang, Jiangyan Yi, Jianhua Tao, Chenglong Wang, Chuyuan Zhang

    Abstract: Current fake audio detection algorithms have achieved promising performances on most datasets. However, their performance may be significantly degraded when dealing with audio of a different dataset. The orthogonal weight modification to overcome catastrophic forgetting does not consider the similarity of genuine audio across different datasets. To overcome this limitation, we propose a continual… ▽ More

    Submitted 7 August, 2023; originally announced August 2023.

    Comments: 40th Internation Conference on Machine Learning (ICML 2023)

  22. arXiv:2307.08323  [pdf, other

    cs.SD eess.AS

    TST: Time-Sparse Transducer for Automatic Speech Recognition

    Authors: Xiaohui Zhang, Mangui Liang, Zhengkun Tian, Jiangyan Yi, Jianhua Tao

    Abstract: End-to-end model, especially Recurrent Neural Network Transducer (RNN-T), has achieved great success in speech recognition. However, transducer requires a great memory footprint and computing time when processing a long decoding sequence. To solve this problem, we propose a model named time-sparse transducer, which introduces a time-sparse mechanism into transducer. In this mechanism, we obtain th… ▽ More

    Submitted 17 July, 2023; originally announced July 2023.

    Comments: 10 pages

    Journal ref: International Conference on Artificial Intelligence (CICAI 2023)

  23. arXiv:2307.05386  [pdf, other

    eess.SP physics.optics

    Exploring the Potential of Integrated Optical Sensing and Communication (IOSAC) Systems with Si Waveguides for Future Networks

    Authors: Xiangpeng Ou, Ying Qiu, Ming Luo, Fujun Sun, Peng Zhang, Gang Yang, Junjie Li, Jianfeng Gao, Xiaobin He, Anyan Du, Bo Tang, Bin Li, Zichen Liu, Zhihua Li, Ling Xie, Xi Xiao, Jun Luo, Wenwu Wang, ** Tao, Yan Yang

    Abstract: Advanced silicon photonic technologies enable integrated optical sensing and communication (IOSAC) in real time for the emerging application requirements of simultaneous sensing and communication for next-generation networks. Here, we propose and demonstrate the IOSAC system on the silicon nitride (SiN) photonics platform. The IOSAC devices based on microring resonators are capable of monitoring t… ▽ More

    Submitted 27 June, 2023; originally announced July 2023.

    Comments: 11pages, 5 figutres

  24. arXiv:2307.04216  [pdf, other

    cs.LG cs.AI eess.IV

    Hierarchical Autoencoder-based Lossy Compression for Large-scale High-resolution Scientific Data

    Authors: Hieu Le, Jian Tao

    Abstract: Lossy compression has become an important technique to reduce data size in many domains. This type of compression is especially valuable for large-scale scientific data, whose size ranges up to several petabytes. Although Autoencoder-based models have been successfully leveraged to compress images and videos, such neural networks have not widely gained attention in the scientific data domain. Our… ▽ More

    Submitted 6 May, 2024; v1 submitted 9 July, 2023; originally announced July 2023.

    Comments: 14 pages

  25. arXiv:2306.05708  [pdf, other

    cs.SD cs.LG eess.AS

    Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion

    Authors: Haogeng Liu, Tao Wang, Jie Cao, Ran He, Jianhua Tao

    Abstract: Denoising Diffusion Probabilistic Models have shown extraordinary ability on various generative tasks. However, their slow inference speed renders them impractical in speech synthesis. This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality. Firstly, we employ linear interpolation between the t… ▽ More

    Submitted 12 June, 2023; v1 submitted 9 June, 2023; originally announced June 2023.

  26. arXiv:2306.05617  [pdf, other

    cs.SD cs.CL eess.AS

    Low-rank Adaptation Method for Wav2vec2-based Fake Audio Detection

    Authors: Chenglong Wang, Jiangyan Yi, Xiaohui Zhang, Jianhua Tao, Le Xu, Ruibo Fu

    Abstract: Self-supervised speech models are a rapidly develo** research topic in fake audio detection. Many pre-trained models can serve as feature extractors, learning richer and higher-level speech features. However,when fine-tuning pre-trained models, there is often a challenge of excessively long training times and high memory consumption, and complete fine-tuning is also very expensive. To alleviate… ▽ More

    Submitted 8 June, 2023; originally announced June 2023.

    Comments: 6pages

    Journal ref: IJCAI 2023 Workshop on Deepfake Audio Detection and Analysis

  27. arXiv:2306.04956  [pdf, other

    cs.SD cs.LG eess.AS

    Adaptive Fake Audio Detection with Low-Rank Model Squeezing

    Authors: Xiaohui Zhang, Jiangyan Yi, Jianhua Tao, Chenlong Wang, Le Xu, Ruibo Fu

    Abstract: The rapid advancement of spoofing algorithms necessitates the development of robust detection methods capable of accurately identifying emerging fake audio. Traditional approaches, such as finetuning on new datasets containing these novel spoofing algorithms, are computationally intensive and pose a risk of impairing the acquired knowledge of known fake audio types. To address these challenges, th… ▽ More

    Submitted 8 June, 2023; originally announced June 2023.

    Journal ref: DADA workshop on IJCAI 2023

  28. arXiv:2305.13774  [pdf, other

    cs.SD eess.AS

    ADD 2023: the Second Audio Deepfake Detection Challenge

    Authors: Jiangyan Yi, Jianhua Tao, Ruibo Fu, Xinrui Yan, Chenglong Wang, Tao Wang, Chu Yuan Zhang, Xiaohui Zhang, Yan Zhao, Yong Ren, Le Xu, Junzuo Zhou, Hao Gu, Zhengqi Wen, Shan Liang, Zheng Lian, Shuai Nie, Haizhou Li

    Abstract: Audio deepfake detection is an emerging topic in the artificial intelligence community. The second Audio Deepfake Detection Challenge (ADD 2023) aims to spur researchers around the world to build new innovative technologies that can further accelerate and foster research on detecting and analyzing deepfake speech utterances. Different from previous challenges (e.g. ADD 2022), ADD 2023 focuses on s… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

  29. arXiv:2305.13701  [pdf, other

    cs.SD eess.AS

    TO-Rawnet: Improving RawNet with TCN and Orthogonal Regularization for Fake Audio Detection

    Authors: Chenglong Wang, Jiangyan Yi, Jianhua Tao, Chuyuan Zhang, Shuai Zhang, Ruibo Fu, Xun Chen

    Abstract: Current fake audio detection relies on hand-crafted features, which lose information during extraction. To overcome this, recent studies use direct feature extraction from raw audio signals. For example, RawNet is one of the representative works in end-to-end fake audio detection. However, existing work on RawNet does not optimize the parameters of the Sinc-conv during training, which limited its… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: Interspeech2023

  30. arXiv:2305.13700  [pdf, other

    cs.SD eess.AS

    Detection of Cross-Dataset Fake Audio Based on Prosodic and Pronunciation Features

    Authors: Chenglong Wang, Jiangyan Yi, Jianhua Tao, Chuyuan Zhang, Shuai Zhang, Xun Chen

    Abstract: Existing fake audio detection systems perform well in in-domain testing, but still face many challenges in out-of-domain testing. This is due to the mismatch between the training and test data, as well as the poor generalizability of features extracted from limited views. To address this, we propose multi-view features for fake audio detection, which aim to capture more generalized features from p… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: Interspeech2023

  31. arXiv:2305.11202  [pdf

    cs.HC cs.SE eess.SY

    LLM-based Frameworks for Power Engineering from Routine to Novel Tasks

    Authors: Ran Li, Chuanqing Pu, Junyi Tao, Canbing Li, Feilong Fan, Yue Xiang, Sijie Chen

    Abstract: The digitalization of energy sectors has expanded the coding responsibilities for power engineers and researchers. This research article explores the potential of leveraging Large Language Models (LLMs) to alleviate this burden. Here, we propose LLM-based frameworks for different programming tasks in power systems. For well-defined and routine tasks like the classic unit commitment (UC) problem, w… ▽ More

    Submitted 19 October, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

  32. arXiv:2305.02269  [pdf, other

    cs.SD cs.CL eess.AS

    M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

    Authors: **long Xue, Yayue Deng, Feng** Wang, Ya Li, Yingming Gao, Jianhua Tao, Jianqing Sun, Jiaen Liang

    Abstract: Conversational text-to-speech (TTS) aims to synthesize speech with proper prosody of reply based on the historical conversation. However, it is still a challenge to comprehensively model the conversation, and a majority of conversational TTS systems only focus on extracting global information and omit local prosody features, which contain important fine-grained information like keywords and emphas… ▽ More

    Submitted 3 May, 2023; originally announced May 2023.

    Comments: 5 pages, 1 figures, 2 tables. Accepted by ICASSP 2023

  33. arXiv:2301.03801  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion

    Authors: Haogeng Liu, Tao Wang, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Jianhua Tao

    Abstract: Text-to-speech (TTS) and voice conversion (VC) are two different tasks both aiming at generating high quality speaking voice according to different input modality. Due to their similarity, this paper proposes UnifySpeech, which brings TTS and VC into a unified framework for the first time. The model is based on the assumption that speech can be decoupled into three independent components: content… ▽ More

    Submitted 10 January, 2023; originally announced January 2023.

  34. arXiv:2212.10191  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Emotion Selectable End-to-End Text-based Speech Editing

    Authors: Tao Wang, Jiangyan Yi, Ruibo Fu, Jianhua Tao, Zhengqi Wen, Chu Yuan Zhang

    Abstract: Text-based speech editing allows users to edit speech by intuitively cutting, copying, and pasting text to speed up the process of editing speech. In the previous work, CampNet (context-aware mask prediction network) is proposed to realize text-based speech editing, significantly improving the quality of edited speech. This paper aims at a new task: adding emotional effect to the editing speech du… ▽ More

    Submitted 20 December, 2022; originally announced December 2022.

    Comments: Under review, 12 pages, 11 figures, demo page is available at https://hairuo55.github.io/Emo-CampNet/

  35. arXiv:2211.12282  [pdf, other

    eess.SP

    Vector Approximate Message Passing based Channel Estimation for MIMO-OFDM Underwater Acoustic Communications

    Authors: Wenxuan Chen, Jun Tao, Lu Ma, Gang Qiao

    Abstract: Accurate channel estimation is critical to the performance of orthogonal frequency-division multiplexing (OFDM) underwater acoustic (UWA) communications, especially under multiple-input multiple-output (MIMO) scenarios. In this paper, we explore Vector Approximate Message Passing (VAMP) coupled with Expected Maximum (EM) to obtain channel estimation (CE) for MIMO OFDM UWA communications. The EM-VA… ▽ More

    Submitted 22 November, 2022; originally announced November 2022.

    Comments: Journal:IEEE Journal of Oceanic Engineering(Date of Submission:2022-06-25)

  36. arXiv:2211.06073  [pdf, other

    cs.SD cs.CL eess.AS

    SceneFake: An Initial Dataset and Benchmarks for Scene Fake Audio Detection

    Authors: Jiangyan Yi, Chenglong Wang, Jianhua Tao, Chu Yuan Zhang, Cunhang Fan, Zhengkun Tian, Haoxin Ma, Ruibo Fu

    Abstract: Many datasets have been designed to further the development of fake audio detection. However, fake utterances in previous datasets are mostly generated by altering timbre, prosody, linguistic content or channel noise of original audio. These datasets leave out a scenario, in which the acoustic scene of an original audio is manipulated with a forged one. It will pose a major threat to our society i… ▽ More

    Submitted 4 April, 2024; v1 submitted 11 November, 2022; originally announced November 2022.

    Comments: Accepted by Pattern Recognition, 1 April 2024

  37. arXiv:2211.05363  [pdf, other

    cs.SD eess.AS

    EmoFake: An Initial Dataset for Emotion Fake Audio Detection

    Authors: Yan Zhao, Jiangyan Yi, Jianhua Tao, Chenglong Wang, Xiaohui Zhang, Yongfeng Dong

    Abstract: Many datasets have been designed to further the development of fake audio detection, such as datasets of the ASVspoof and ADD challenges. However, these datasets do not consider a situation that the emotion of the audio has been changed from one to another, while other information (e.g. speaker identity and content) remains the same. Changing the emotion of an audio can lead to semantic changes. S… ▽ More

    Submitted 14 September, 2023; v1 submitted 10 November, 2022; originally announced November 2022.

  38. arXiv:2210.12311  [pdf, other

    eess.SP

    Proportionate Recursive Maximum Correntropy Criterion Adaptive Filtering Algorithms and their Performance Analysis

    Authors: Zhen Qin, Jun Tao, Le Yang, Ming Jiang

    Abstract: The maximum correntropy criterion (MCC) has been employed to design outlier-robust adaptive filtering algorithms, among which the recursive MCC (RMCC) algorithm is a typical one. Motivated by the success of our recently proposed proportionate recursive least squares (PRLS) algorithm for sparse system identification, we propose to introduce the proportionate updating (PU) mechanism into the RMCC, l… ▽ More

    Submitted 7 October, 2023; v1 submitted 21 October, 2022; originally announced October 2022.

  39. arXiv:2210.11429  [pdf

    cs.SD eess.AS

    Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS

    Authors: Chunyu Qiang, Jianhua Tao, Ruibo Fu, Zhengqi Wen, Jiangyan Yi, Tao Wang, Shiming Wang

    Abstract: Current end-to-end code-switching Text-to-Speech (TTS) can already generate high quality two languages speech in the same utterance with single speaker bilingual corpora. When the speakers of the bilingual corpora are different, the naturalness and consistency of the code-switching TTS will be poor. The cross-lingual embedding layers structure we proposed makes similar syllables in different langu… ▽ More

    Submitted 20 October, 2022; originally announced October 2022.

    Comments: accepted in ISCSLP 2021

  40. arXiv:2208.10489  [pdf, other

    cs.SD cs.AI eess.AS

    System Fingerprint Recognition for Deepfake Audio: An Initial Dataset and Investigation

    Authors: Xinrui Yan, Jiangyan Yi, Chenglong Wang, Jianhua Tao, Junzuo Zhou, Hao Gu, Ruibo Fu

    Abstract: The rapid progress of deep speech synthesis models has posed significant threats to society such as malicious content manipulation. Therefore, many studies have emerged to detect the so-called deepfake audio. However, existing works focus on the binary detection of real audio and fake audio. In real-world scenarios such as model copyright protection and digital evidence forensics, it is needed to… ▽ More

    Submitted 15 September, 2023; v1 submitted 21 August, 2022; originally announced August 2022.

    Comments: 13 pages, 4 figures. Submit to IEEE Transactions on Audio, Speech and Language Processing (TASLP). arXiv admin note: text overlap with arXiv:2208.09646

  41. arXiv:2208.09646  [pdf, other

    cs.SD cs.AI eess.AS

    An Initial Investigation for Detecting Vocoder Fingerprints of Fake Audio

    Authors: Xinrui Yan, Jiangyan Yi, Jianhua Tao, Chenglong Wang, Haoxin Ma, Tao Wang, Shiming Wang, Ruibo Fu

    Abstract: Many effective attempts have been made for fake audio detection. However, they can only provide detection results but no countermeasures to curb this harm. For many related practical applications, what model or algorithm generated the fake audio also is needed. Therefore, We propose a new problem for detecting vocoder fingerprints of fake audio. Experiments are conducted on the datasets synthesize… ▽ More

    Submitted 20 August, 2022; originally announced August 2022.

    Comments: Accepted by ACM Multimedia 2022 Workshop: First International Workshop on Deepfake Detection for Audio Multimedia

  42. arXiv:2208.09618  [pdf, other

    cs.SD cs.AI eess.AS

    Fully Automated End-to-End Fake Audio Detection

    Authors: Chenglong Wang, Jiangyan Yi, Jianhua Tao, Haiyang Sun, Xun Chen, Zhengkun Tian, Haoxin Ma, Cunhang Fan, Ruibo Fu

    Abstract: The existing fake audio detection systems often rely on expert experience to design the acoustic features or manually design the hyperparameters of the network structure. However, artificial adjustment of the parameters can have a relatively obvious influence on the results. It is almost impossible to manually set the best set of parameters. Therefore this paper proposes a fully automated end-toen… ▽ More

    Submitted 20 August, 2022; originally announced August 2022.

  43. arXiv:2208.01214  [pdf, other

    cs.SD cs.LG eess.AS

    Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features

    Authors: Jun Xue, Cunhang Fan, Zhao Lv, Jianhua Tao, Jiangyan Yi, Chengshi Zheng, Zhengqi Wen, Minmin Yuan, Shegang Shao

    Abstract: Recently, pioneer research works have proposed a large number of acoustic features (log power spectrogram, linear frequency cepstral coefficients, constant Q cepstral coefficients, etc.) for audio deepfake detection, obtaining good performance, and showing that different subbands have different contributions to audio deepfake detection. However, this lacks an explanation of the specific informatio… ▽ More

    Submitted 1 August, 2022; originally announced August 2022.

  44. arXiv:2207.12308  [pdf, other

    cs.SD eess.AS

    CFAD: A Chinese Dataset for Fake Audio Detection

    Authors: Haoxin Ma, Jiangyan Yi, Chenglong Wang, Xinrui Yan, Jianhua Tao, Tao Wang, Shiming Wang, Ruibo Fu

    Abstract: Fake audio detection is a growing concern and some relevant datasets have been designed for research. However, there is no standard public Chinese dataset under complex conditions.In this paper, we aim to fill in the gap and design a Chinese fake audio detection dataset (CFAD) for studying more generalized detection methods. Twelve mainstream speech-generation techniques are used to generate fake… ▽ More

    Submitted 18 July, 2023; v1 submitted 12 July, 2022; originally announced July 2022.

    Comments: FAD renamed as CFAD

  45. arXiv:2206.07280  [pdf

    eess.IV cs.CV cs.NE

    ERNAS: An Evolutionary Neural Architecture Search for Magnetic Resonance Image Reconstructions

    Authors: Samira Vafay Eslahi, Jian Tao, Jim Ji

    Abstract: Magnetic resonance imaging (MRI) is one of the noninvasive imaging modalities that can produce high-quality images. However, the scan procedure is relatively slow, which causes patient discomfort and motion artifacts in images. Accelerating MRI hardware is constrained by physical and physiological limitations. A popular alternative approach to accelerated MRI is to undersample the k-space data. Wh… ▽ More

    Submitted 16 January, 2023; v1 submitted 14 June, 2022; originally announced June 2022.

    Comments: 11 pages, 9 figures, and 4 tables

  46. arXiv:2203.13617  [pdf, other

    eess.AS cs.LG cs.SD

    EmotionNAS: Two-stream Neural Architecture Search for Speech Emotion Recognition

    Authors: Haiyang Sun, Zheng Lian, Bin Liu, Ying Li, Licai Sun, Cong Cai, Jianhua Tao, Meng Wang, Yuan Cheng

    Abstract: Speech emotion recognition (SER) is an important research topic in human-computer interaction. Existing works mainly rely on human expertise to design models. Despite their success, different datasets often require distinct structures and hyperparameters. Searching for an optimal model for each dataset is time-consuming and labor-intensive. To address this problem, we propose a two-stream neural a… ▽ More

    Submitted 9 June, 2023; v1 submitted 25 March, 2022; originally announced March 2022.

    Comments: Accepted to Interspeech 2023

  47. arXiv:2203.02678  [pdf, other

    cs.SD cs.CL eess.AS

    NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation

    Authors: Tao Wang, Ruibo Fu, Jiangyan Yi, Jianhua Tao, Zhengqi Wen

    Abstract: The traditional vocoders have the advantages of high synthesis efficiency, strong interpretability, and speech editability, while the neural vocoders have the advantage of high synthesis quality. To combine the advantages of two vocoders, inspired by the traditional deterministic plus stochastic model, this paper proposes a novel neural vocoder named NeuralDPS which can retain high speech quality… ▽ More

    Submitted 5 March, 2022; originally announced March 2022.

    Comments: 15 pages, 12 figures; Accepted to TASLP. Demo page https://hairuo55.github.io/NeuralDPS. arXiv admin note: text overlap with arXiv:1906.09573 by other authors

  48. arXiv:2202.09950  [pdf, other

    cs.SD cs.CL eess.AS

    CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing

    Authors: Tao Wang, Jiangyan Yi, Ruibo Fu, Jianhua Tao, Zhengqi Wen

    Abstract: The text-based speech editor allows the editing of speech through intuitive cutting, copying, and pasting operations to speed up the process of editing speech. However, the major drawback of current systems is that edited speech often sounds unnatural due to cut-copy-paste operation. In addition, it is not obvious how to synthesize records according to a new word not appearing in the transcript. T… ▽ More

    Submitted 22 March, 2022; v1 submitted 20 February, 2022; originally announced February 2022.

    Comments: under review, 14 pages, 14 figures, demo page is available at https://hairuo55.github.io/CampNet

  49. arXiv:2202.08433  [pdf, ps, other

    cs.SD cs.LG eess.AS

    ADD 2022: the First Audio Deep Synthesis Detection Challenge

    Authors: Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Ye Bai, Cunhang Fan, Shan Liang, Shiming Wang, Shuai Zhang, Xinrui Yan, Le Xu, Zhengqi Wen, Haizhou Li, Zheng Lian, Bin Liu

    Abstract: Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021. However, the recent shared tasks have not covered many real-life and challenging scenarios. The first Audio Deep synthesis Detection challenge (ADD) was motivated to fill in the gap. The ADD 2022 includes three tracks: low-quality fake audio detection (LF), partially fake audio detection (PF) and audio fake gam… ▽ More

    Submitted 26 February, 2022; v1 submitted 16 February, 2022; originally announced February 2022.

    Comments: Accepted by ICASSP 2022

  50. arXiv:2202.07907  [pdf, other

    cs.SD cs.LG eess.AS

    Singing-Tacotron: Global duration control attention and dynamic filter for End-to-end singing voice synthesis

    Authors: Tao Wang, Ruibo Fu, Jiangyan Yi, Jianhua Tao, Zhengqi Wen

    Abstract: End-to-end singing voice synthesis (SVS) is attractive due to the avoidance of pre-aligned data. However, the auto learned alignment of singing voice with lyrics is difficult to match the duration information in musical score, which will lead to the model instability or even failure to synthesize voice. To learn accurate alignment information automatically, this paper proposes an end-to-end SVS fr… ▽ More

    Submitted 16 February, 2022; originally announced February 2022.

    Comments: 5 pages, 7 figures