Skip to main content

Showing 1–32 of 32 results for author: Du, C

Searching in archive eess. Search in all archives.
.
  1. arXiv:2405.18726  [pdf, other

    cs.SD cs.CV cs.MM eess.AS

    Reverse the auditory processing pathway: Coarse-to-fine audio reconstruction from fMRI

    Authors: Che Liu, Changde Du, Xiaoyu Chen, Huiguang He

    Abstract: Drawing inspiration from the hierarchical processing of the human auditory system, which transforms sound from low-level acoustic features to high-level semantic understanding, we introduce a novel coarse-to-fine audio reconstruction method. Leveraging non-invasive functional Magnetic Resonance Imaging (fMRI) data, our approach mimics the inverse pathway of auditory processing. Initially, we utili… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

  2. arXiv:2404.19723  [pdf, other

    eess.AS cs.SD

    Attention-Constrained Inference for Robust Decoder-Only Text-to-Speech

    Authors: Hankun Wang, Chenpeng Du, Yiwei Guo, Shuai Wang, Xie Chen, Kai Yu

    Abstract: Recent popular decoder-only text-to-speech models are known for their ability of generating natural-sounding speech. However, such models sometimes suffer from word skip** and repeating due to the lack of explicit monotonic alignment constraints. In this paper, we notice from the attention maps that some particular attention heads of the decoder-only model indicate the alignments between speech… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

  3. arXiv:2404.17890  [pdf, other

    eess.IV cs.AI cs.CV

    DPER: Diffusion Prior Driven Neural Representation for Limited Angle and Sparse View CT Reconstruction

    Authors: Chenhe Du, Xiyue Lin, Qing Wu, Xuanyu Tian, Ying Su, Zhe Luo, Hongjiang Wei, S. Kevin Zhou, **gyi Yu, Yuyao Zhang

    Abstract: Limited-angle and sparse-view computed tomography (LACT and SVCT) are crucial for expanding the scope of X-ray CT applications. However, they face challenges due to incomplete data acquisition, resulting in diverse artifacts in the reconstructed CT images. Emerging implicit neural representation (INR) techniques, such as NeRF, NeAT, and NeRP, have shown promise in under-determined CT imaging recon… ▽ More

    Submitted 27 April, 2024; originally announced April 2024.

    Comments: 15 pages, 10 figures

    ACM Class: I.2.10; I.4.5

  4. arXiv:2404.17484  [pdf, other

    cs.CV eess.IV

    Sparse Reconstruction of Optical Doppler Tomography Based on State Space Model

    Authors: Zhenghong Li, Jiaxiang Ren, Wensheng Cheng, Congwu Du, Yingtian Pan, Haibin Ling

    Abstract: Optical Doppler Tomography (ODT) is a blood flow imaging technique popularly used in bioengineering applications. The fundamental unit of ODT is the 1D frequency response along the A-line (depth), named raw A-scan. A 2D ODT image (B-scan) is obtained by first sensing raw A-scans along the B-line (width), and then constructing the B-scan from these raw A-scans via magnitude-phase analysis and post-… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

    Comments: 19 pages, 5 figures

  5. arXiv:2404.16327  [pdf, other

    cs.IT eess.SP

    Generalized Step-Chirp Sequences With Flexible Bandwidth

    Authors: Cheng Du, Yi Jiang

    Abstract: Sequences with low aperiodic autocorrelation sidelobes have been extensively researched in literatures. With sufficiently low integrated sidelobe level (ISL), their power spectrums are asymptotically flat over the whole frequency domain. However, for the beam swee** in the massive multi-input multi-output (MIMO) broadcast channels, the flat spectrum should be constrained in a passband with tunab… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: Accepted by 2024 IEEE International Symposium on Information Theory

  6. arXiv:2404.06079  [pdf, other

    eess.AS cs.AI

    The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge

    Authors: Yiwei Guo, Chenrun Wang, Yifan Yang, Hankun Wang, Ziyang Ma, Chenpeng Du, Shuai Wang, Hanzheng Li, Shuai Fan, Hui Zhang, Xie Chen, Kai Yu

    Abstract: Discrete speech tokens have been more and more popular in multiple speech processing fields, including automatic speech recognition (ASR), text-to-speech (TTS) and singing voice synthesis (SVS). In this paper, we describe the systems developed by the SJTU X-LANCE group for the TTS (acoustic + vocoder), SVS, and ASR tracks in the Interspeech 2024 Speech Processing Using Discrete Speech Unit Challen… ▽ More

    Submitted 9 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

    Comments: 5 pages, 3 figures. Report of a challenge

  7. arXiv:2401.14321  [pdf, other

    eess.AS cs.SD

    VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech

    Authors: Chenpeng Du, Yiwei Guo, Hankun Wang, Yifan Yang, Zhikang Niu, Shuai Wang, Hui Zhang, Xie Chen, Kai Yu

    Abstract: Recent TTS models with decoder-only Transformer architecture, such as SPEAR-TTS and VALL-E, achieve impressive naturalness and demonstrate the ability for zero-shot adaptation given a speech prompt. However, such decoder-only TTS models lack monotonic alignment constraints, sometimes leading to hallucination issues such as mispronunciation, word skip** and repeating. To address this limitation,… ▽ More

    Submitted 29 January, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

  8. arXiv:2310.14580  [pdf, other

    cs.SD eess.AS

    Acoustic BPE for Speech Generation with Discrete Tokens

    Authors: Feiyu Shen, Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu

    Abstract: Discrete audio tokens derived from self-supervised learning models have gained widespread usage in speech generation. However, current practice of directly utilizing audio tokens poses challenges for sequence modeling due to the length of the token sequence. Additionally, this approach places the burden on the model to establish correlations between tokens, further complicating the modeling proces… ▽ More

    Submitted 15 January, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: 5 pages, 2 figures; accepted to ICASSP 2024

  9. arXiv:2309.07377  [pdf, other

    eess.AS cs.SD

    Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

    Authors: Yifan Yang, Feiyu Shen, Chenpeng Du, Ziyang Ma, Kai Yu, Daniel Povey, Xie Chen

    Abstract: Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into utilizing discrete tokens for speech tasks like recognition and translation, which offer lower storage requirements and great potential to employ natural language processing techniques. However, these studies, mainly single-task focused, faced challenges like overfitting and performance degradation in speec… ▽ More

    Submitted 14 December, 2023; v1 submitted 13 September, 2023; originally announced September 2023.

    Comments: Accepted in ICASSP 2024

  10. arXiv:2309.05027  [pdf, other

    eess.AS cs.AI cs.HC cs.SD

    VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

    Authors: Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, Kai Yu

    Abstract: Although diffusion models in text-to-speech have become a popular choice due to their strong generative ability, the intrinsic complexity of sampling from diffusion models harms their efficiency. Alternatively, we propose VoiceFlow, an acoustic model that utilizes a rectified flow matching algorithm to achieve high synthesis quality with a limited number of sampling steps. VoiceFlow formulates the… ▽ More

    Submitted 16 January, 2024; v1 submitted 10 September, 2023; originally announced September 2023.

    Comments: 4 figure, 5 pages, accepted to ICASSP 2024

  11. arXiv:2308.04244  [pdf, other

    cs.SD cs.HC eess.AS q-bio.NC q-bio.QM

    Auditory Attention Decoding with Task-Related Multi-View Contrastive Learning

    Authors: Xiaoyu Chen, Changde Du, Qiongyi Zhou, Huiguang He

    Abstract: The human brain can easily focus on one speaker and suppress others in scenarios such as a cocktail party. Recently, researchers found that auditory attention can be decoded from the electroencephalogram (EEG) data. However, most existing deep learning methods are difficult to use prior knowledge of different views (that is attended speech and EEG are task-related views) and extract an unsatisfact… ▽ More

    Submitted 8 August, 2023; originally announced August 2023.

  12. arXiv:2306.14145  [pdf, other

    cs.SD cs.CL eess.AS

    DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech

    Authors: Sen Liu, Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu

    Abstract: Although high-fidelity speech can be obtained for intralingual speech synthesis, cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres(i.e. speaker similarity) and eliminate the accents from their first language(i.e. nativeness). In this paper, we demonstrated that vector-quantized(VQ) acoustic feature contains less speaker i… ▽ More

    Submitted 25 June, 2023; originally announced June 2023.

    Comments: Accepted to Interspeech 2023

  13. arXiv:2306.08588  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation

    Authors: Zheng Liang, Zheshu Song, Ziyang Ma, Chenpeng Du, Kai Yu, Xie Chen

    Abstract: Recently, end-to-end (E2E) automatic speech recognition (ASR) models have made great strides and exhibit excellent performance in general speech recognition. However, there remain several challenging scenarios that E2E models are not competent in, such as code-switching and named entity recognition (NER). Data augmentation is a common and effective practice for these two scenarios. However, the cu… ▽ More

    Submitted 14 June, 2023; originally announced June 2023.

    Comments: Accepted by Interspeech 2023

  14. UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

    Authors: Chenpeng Du, Yiwei Guo, Feiyu Shen, Zhijun Liu, Zheng Liang, Xie Chen, Shuai Wang, Hui Zhang, Kai Yu

    Abstract: The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens, has been proven superior to traditional acoustic feature mel-spectrograms in terms of naturalness and robustness for text-to-speech (TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow zero-shot speaker adaptation through auto-regressive (AR) continuation of acoustic tokens extracted… ▽ More

    Submitted 28 March, 2024; v1 submitted 13 June, 2023; originally announced June 2023.

    Comments: Accepted to AAAI 2024

  15. Robust Power Allocation for Integrated Visible Light Positioning and Communication Networks

    Authors: Shuai Ma, Ruixin Yang, Chun Du, Hang Li, Youlong Wu, Naofal Al-Dhahir, Shiyin Li

    Abstract: Integrated visible light positioning and communication (VLPC), capable of combining advantages of visible light communications (VLC) and visible light positioning (VLP), is a promising key technology for the future Internet of Things. In VLPC networks, positioning and communications are inherently coupled, which has not been sufficiently explored in the literature. We propose a robust power alloca… ▽ More

    Submitted 16 May, 2023; originally announced May 2023.

    Comments: 13 pages, 15 figures, accepted by IEEE Transactions on Communications

  16. arXiv:2305.03177  [pdf

    eess.SP cs.CV cs.LG eess.IV physics.optics

    Deep Learning-Assisted Simultaneous Targets Sensing and Super-Resolution Imaging

    Authors: ** Zhao, Huang Zhao Zhang, Ming-Zhe Chong, Yue-Yi Zhang, Zi-Wen Zhang, Zong-Kun Zhang, Chao-Hai Du, Pu-Kun Liu

    Abstract: Recently, metasurfaces have experienced revolutionary growth in the sensing and superresolution imaging field, due to their enabling of subwavelength manipulation of electromagnetic waves. However, the addition of metasurfaces multiplies the complexity of retrieving target information from the detected fields. Besides, although the deep learning method affords a compelling platform for a series of… ▽ More

    Submitted 2 May, 2023; originally announced May 2023.

  17. arXiv:2304.13121  [pdf, other

    cs.SD eess.AS

    Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge

    Authors: Chenpeng Du, Yiwei Guo, Feiyu Shen, Kai Yu

    Abstract: In this paper, we describe the systems developed by the SJTU X-LANCE team for LIMMITS 2023 Challenge, and we mainly focus on the winning system on naturalness for track 1. The aim of this challenge is to build a multi-speaker multi-lingual text-to-speech (TTS) system for Marathi, Hindi and Telugu. Each of the languages has a male and a female speaker in the given dataset. In track 1, only 5 hours… ▽ More

    Submitted 25 April, 2023; originally announced April 2023.

    Comments: Accepted by ICASSP 2023 Special Session for Grand Challenges

  18. arXiv:2212.10794  [pdf, ps, other

    eess.SP

    Joint Beamforming and PD Orientation Design for Mobile Visible Light Communications

    Authors: Shuai Ma, **g Wang, Chun Du, Hang Li, Xiaodong Liu, Youlong Wu, Naofal Al-Dhahir, Shiyin Li

    Abstract: In this paper, we propose joint beamforming and photo-detector (PD) orientation (BO) optimization schemes for mobile visible light communication (VLC) with the orientation adjustable receiver (OAR). Since VLC is sensitive to line-of-sight propagation, we first establish the OAR model and the human body blockage model for mobile VLC user equipment (UE). To guarantee the quality of service (QoS) of… ▽ More

    Submitted 21 December, 2022; originally announced December 2022.

  19. arXiv:2211.09496  [pdf, other

    eess.AS cs.AI cs.HC cs.LG cs.SD

    EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance

    Authors: Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu

    Abstract: Although current neural text-to-speech (TTS) models are able to generate high-quality speech, intensity controllable emotional TTS is still a challenging task. Most existing methods need external optimizations for intensity calculation, leading to suboptimal results or degraded quality. In this paper, we propose EmoDiff, a diffusion-based TTS model where emotion intensity can be manipulated by a p… ▽ More

    Submitted 16 February, 2023; v1 submitted 17 November, 2022; originally announced November 2022.

    Comments: Accepted to ICASSP2023

  20. arXiv:2211.02629  [pdf, other

    eess.SP cs.AI cs.HC cs.LG q-bio.NC

    Multi-view Multi-label Fine-grained Emotion Decoding from Human Brain Activity

    Authors: Kaicheng Fu, Changde Du, Shengpei Wang, Huiguang He

    Abstract: Decoding emotional states from human brain activity plays an important role in brain-computer interfaces. Existing emotion decoding methods still have two main limitations: one is only decoding a single emotion category from a brain activity pattern and the decoded emotion categories are coarse-grained, which is inconsistent with the complex emotional expression of human; the other is ignoring the… ▽ More

    Submitted 26 October, 2022; originally announced November 2022.

    Comments: Accepted by IEEE Transactions on Neural Networks and Learning Systems

  21. Constructions of Polyphase Golay Complementary Arrays

    Authors: Cheng Du, Yi Jiang

    Abstract: Golay complementary matrices (GCM) have recently drawn considerable attentions owing to its potential applications in omnidirectional precoding. In this paper we generalize the GCM to multi-dimensional Golay complementary arrays (GCA) and propose new constructions of GCA pairs and GCA quads. These constructions are facilitated by introducing a set of identities over a commutative ring. We prove th… ▽ More

    Submitted 20 April, 2022; originally announced April 2022.

    Comments: 28 pages

    Journal ref: IEEE Transactions on Information Theory, vol. 70, no. 2, pp. 1422-1435, Feb. 2024

  22. arXiv:2204.00768  [pdf, other

    eess.AS cs.SD

    VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature

    Authors: Chenpeng Du, Yiwei Guo, Xie Chen, Kai Yu

    Abstract: The mainstream neural text-to-speech(TTS) pipeline is a cascade system, including an acoustic model(AM) that predicts acoustic feature from the input transcript and a vocoder that generates waveform according to the given acoustic feature. However, the acoustic feature in current TTS systems is typically mel-spectrogram, which is highly correlated along both time and frequency axes in a complicate… ▽ More

    Submitted 30 June, 2022; v1 submitted 2 April, 2022; originally announced April 2022.

    Comments: Accepted to Interspeech 2022

  23. arXiv:2202.07200  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    Unsupervised word-level prosody tagging for controllable speech synthesis

    Authors: Yiwei Guo, Chenpeng Du, Kai Yu

    Abstract: Although word-level prosody modeling in neural text-to-speech (TTS) has been investigated in recent research for diverse speech synthesis, it is still challenging to control speech synthesis manually without a specific reference. This is largely due to lack of word-level prosody tags. In this work, we propose a novel approach for unsupervised word-level prosody tagging with two stages, where we fi… ▽ More

    Submitted 16 February, 2022; v1 submitted 15 February, 2022; originally announced February 2022.

    Comments: 5 pages, 6 figures, accepted to ICASSP2022

  24. arXiv:2109.12014  [pdf, other

    cs.SD cs.LG eess.AS

    A data acquisition setup for data driven acoustic design

    Authors: Romana Rust, Achilleas Xydis, Kurt Heutschi, Nathanaël Perraudin, Gonzalo Casas, Chaoyu Du, Jürgen Strauss, Kurt Eggenschwiler, Fernando Perez-Cruz, Fabio Gramazio, Matthias Kohler

    Abstract: In this paper, we present a novel interdisciplinary approach to study the relationship between diffusive surface structures and their acoustic performance. Using computational design, surface structures are iteratively generated and 3D printed at 1:10 model scale. They originate from different fabrication typologies and are designed to have acoustic diffusion and absorption effects. An automated r… ▽ More

    Submitted 24 September, 2021; originally announced September 2021.

    Journal ref: Building Acoustics. February 2021

  25. arXiv:2106.13849  [pdf, other

    cs.CV eess.IV

    A CNN Segmentation-Based Approach to Object Detection and Tracking in Ultrasound Scans with Application to the Vagus Nerve Detection

    Authors: Abdullah F. Al-Battal, Yan Gong, Lu Xu, Timothy Morton, Chen Du, Yifeng Bu 1, Imanuel R Lerman, Radhika Madhavan, Truong Q. Nguyen

    Abstract: Ultrasound scanning is essential in several medical diagnostic and therapeutic applications. It is used to visualize and analyze anatomical features and structures that influence treatment plans. However, it is both labor intensive, and its effectiveness is operator dependent. Real-time accurate and robust automatic detection and tracking of anatomical structures while scanning would significantly… ▽ More

    Submitted 25 June, 2021; originally announced June 2021.

    Comments: 7 pages , 4 figures, submitted to the IEEE EMBC 2021 conference

  26. arXiv:2105.13086  [pdf, other

    cs.SD eess.AS

    Phone-Level Prosody Modelling with GMM-Based MDN for Diverse and Controllable Speech Synthesis

    Authors: Chenpeng Du, Kai Yu

    Abstract: Generating natural speech with a diverse and smooth prosody pattern is a challenging task. Although random sampling with phone-level prosody distribution has been investigated to generate different prosody patterns, the diversity of the generated speech is still very limited and far from what can be achieved by humans. This is largely due to the use of uni-modal distribution, such as single Gaussi… ▽ More

    Submitted 30 November, 2021; v1 submitted 27 May, 2021; originally announced May 2021.

    Comments: Accepted to TASLP 2021. arXiv admin note: substantial text overlap with arXiv: 2102.00851

  27. Data Augmentation for End-to-end Code-switching Speech Recognition

    Authors: Chenpeng Du, Hao Li, Yizhou Lu, Lan Wang, Yanmin Qian

    Abstract: Training a code-switching end-to-end automatic speech recognition (ASR) model normally requires a large amount of data, while code-switching data is often limited. In this paper, three novel approaches are proposed for code-switching data augmentation. Specifically, they are audio splicing with the existing code-switching data, and TTS with new code-switching texts generated by word translation or… ▽ More

    Submitted 4 November, 2020; originally announced November 2020.

    Comments: Accepted by SLT2021

    Journal ref: 2021 IEEE Spoken Language Technology Workshop (SLT)

  28. arXiv:2009.06889  [pdf

    eess.SY math.OC

    Distributed Model Predicted Control of Multi-agent Systems with Applications to Multi-vehicle Cooperation

    Authors: Yougang Bian, Changkun Du, Manjiang Hu, Haikuo Liu

    Abstract: This paper proposes a distributed model predicted control (DMPC) approach for consensus control of multi-agent systems (MASs) with linear agent dynamics and bounded control input constraints. Within the proposed DMPC framework, each agent exchanges assumed state trajectories with neighbors and solves a local open-loop optimization problem to obtain the optimal control input. In the optimization pr… ▽ More

    Submitted 15 September, 2020; originally announced September 2020.

  29. arXiv:1903.11385  [pdf, ps, other

    eess.SP cs.LG stat.ML

    Signal Demodulation with Machine Learning Methods for Physical Layer Visible Light Communications: Prototype Platform, Open Dataset and Algorithms

    Authors: Shuai Ma, Jiahui Dai, Songtao Lu, Hang Li, Han Zhang, Chun Du, Shiyin Li

    Abstract: In this paper, we investigate the design and implementation of machine learning (ML) based demodulation methods in the physical layer of visible light communication (VLC) systems. We build a flexible hardware prototype of an end-to-end VLC system, from which the received signals are collected as the real data. The dataset is available online, which contains eight types of modulated signals. Then,… ▽ More

    Submitted 13 March, 2019; originally announced March 2019.

  30. arXiv:1808.02096  [pdf, other

    eess.SP cs.CV cs.LG cs.MM

    Semi-supervised Deep Generative Modelling of Incomplete Multi-Modality Emotional Data

    Authors: Changde Du, Changying Du, Hao Wang, **peng Li, Wei-Long Zheng, Bao-Liang Lu, Huiguang He

    Abstract: There are threefold challenges in emotion recognition. First, it is difficult to recognize human's emotional states only considering a single modality. Second, it is expensive to manually annotate the emotional data. Third, emotional data often suffers from missing modalities due to unforeseeable sensor malfunction or configuration issues. In this paper, we address all these problems under a novel… ▽ More

    Submitted 27 July, 2018; originally announced August 2018.

    Comments: arXiv admin note: text overlap with arXiv:1704.07548, 2018 ACM Multimedia Conference (MM'18)

  31. arXiv:1806.10781  [pdf, other

    cs.CV eess.IV

    Accurate and efficient video de-fencing using convolutional neural networks and temporal information

    Authors: Chen Du, Byeongkeun Kang, Zheng Xu, Ji Dai, Truong Nguyen

    Abstract: De-fencing is to eliminate the captured fence on an image or a video, providing a clear view of the scene. It has been applied for many purposes including assisting photographers and improving the performance of computer vision algorithms such as object detection and recognition. However, the state-of-the-art de-fencing methods have limited performance caused by the difficulty of fence segmentatio… ▽ More

    Submitted 28 June, 2018; originally announced June 2018.

  32. arXiv:1711.04646  [pdf

    eess.SP physics.optics

    Orbital-angular-momentum mode-group multiplexed transmission over a graded-index ring-core fiber based on receive diversity and maximal ratio combining

    Authors: Junwei Zhang, Guoxuan Zhu, Liu Jie, Xiong Wu, Jianbo Zhu, Cheng Du, Wenyong Luo, Siyuan Yu

    Abstract: An orbital-angular-momentum (OAM) mode-group multiplexing (MGM) scheme based on a graded-index ring-core fiber (GIRCF) is proposed, in which a single-input two-output (or receive diversity) architecture is designed for each MG channel and simple digital signal processing (DSP) is utilized to adaptively resist the mode partition noise resulting from random intra-group mode crosstalk. There is no ne… ▽ More

    Submitted 9 November, 2017; originally announced November 2017.

    Comments: 13 pages, 6 figures