Skip to main content

Showing 1–29 of 29 results for author: Zeng, C

Searching in archive eess. Search in all archives.
.
  1. arXiv:2403.05989  [pdf, other

    cs.SD eess.AS

    HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

    Authors: Chunhui Wang, Chang Zeng, Bowen Zhang, Ziyang Ma, Yefan Zhu, Zifeng Cai, Jian Zhao, Zhonglin Jiang, Yong Chen

    Abstract: Token-based text-to-speech (TTS) models have emerged as a promising avenue for generating natural and realistic speech, yet they grapple with low pronunciation accuracy, speaking style and timbre inconsistency, and a substantial need for diverse training data. In response, we introduce a novel hierarchical acoustic modeling approach complemented by a tailored data augmentation strategy and train i… ▽ More

    Submitted 9 March, 2024; originally announced March 2024.

  2. arXiv:2309.13166  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    Invisible Watermarking for Audio Generation Diffusion Models

    Authors: Xirong Cao, Xiang Li, Divyesh Jadav, Yanzhao Wu, Zhehui Chen, Chen Zeng, Wenqi Wei

    Abstract: Diffusion models have gained prominence in the image domain for their capabilities in data generation and transformation, achieving state-of-the-art performance in various tasks in both image and audio domains. In the rapidly evolving field of audio-based machine learning, safeguarding model integrity and establishing data copyright are of paramount importance. This paper presents the first waterm… ▽ More

    Submitted 31 October, 2023; v1 submitted 22 September, 2023; originally announced September 2023.

    Comments: This is an invited paper for IEEE TPS, part of the IEEE CIC/CogMI/TPS 2023 conference

  3. arXiv:2309.12672  [pdf, other

    cs.SD eess.AS

    CrossSinger: A Cross-Lingual Multi-Singer High-Fidelity Singing Voice Synthesizer Trained on Monolingual Singers

    Authors: Xintong Wang, Chang Zeng, Jun Chen, Chunhui Wang

    Abstract: It is challenging to build a multi-singer high-fidelity singing voice synthesis system with cross-lingual ability by only using monolingual singers in the training stage. In this paper, we propose CrossSinger, which is a cross-lingual singing voice synthesizer based on Xiaoicesing2. Specifically, we utilize International Phonetic Alphabet to unify the representation for all languages of the traini… ▽ More

    Submitted 22 September, 2023; originally announced September 2023.

    Comments: Accepted by ASRU2023

  4. arXiv:2307.02751  [pdf, ps, other

    cs.SD cs.CR eess.AS

    DSARSR: Deep Stacked Auto-encoders Enhanced Robust Speaker Recognition

    Authors: Zhifeng Wang, Chunyan Zeng, Surong Duan, Hongjie Ouyang, Hongmin Xu

    Abstract: Speaker recognition is a biometric modality that utilizes the speaker's speech segments to recognize the identity, determining whether the test speaker belongs to one of the enrolled speakers. In order to improve the robustness of the i-vector framework on cross-channel conditions and explore the nova method for applying deep learning to speaker recognition, the Stacked Auto-encoders are used to g… ▽ More

    Submitted 5 July, 2023; originally announced July 2023.

    Comments: 12 pages, 3 figures

  5. arXiv:2305.10940  [pdf, other

    eess.AS

    Improving Generalization Ability of Countermeasures for New Mismatch Scenario by Combining Multiple Advanced Regularization Terms

    Authors: Chang Zeng, Xin Wang, Xiaoxiao Miao, Erica Cooper, Junichi Yamagishi

    Abstract: The ability of countermeasure models to generalize from seen speech synthesis methods to unseen ones has been investigated in the ASVspoof challenge. However, a new mismatch scenario in which fake audio may be generated from real audio with unseen genres has not been studied thoroughly. To this end, we first use five different vocoders to create a new dataset called CN-Spoof based on the CN-Celeb1… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted by interspeech2023

  6. arXiv:2303.13072  [pdf, other

    cs.SD cs.CL eess.AS

    Beyond Universal Transformer: block reusing with adaptor in Transformer for automatic speech recognition

    Authors: Haoyu Tang, Zhaoyi Liu, Chang Zeng, Xinfeng Li

    Abstract: Transformer-based models have recently made significant achievements in the application of end-to-end (E2E) automatic speech recognition (ASR). It is possible to deploy the E2E ASR system on smart devices with the help of Transformer-based models. While these models still have the disadvantage of requiring a large number of model parameters. To overcome the drawback of universal Transformer models… ▽ More

    Submitted 5 April, 2023; v1 submitted 23 March, 2023; originally announced March 2023.

  7. arXiv:2302.11254  [pdf, other

    cs.SD cs.CV cs.LG eess.AS eess.IV

    Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification

    Authors: Meng Liu, Kong Aik Lee, Longbiao Wang, Hanyi Zhang, Chang Zeng, Jianwu Dang

    Abstract: Visual speech (i.e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production. This paper investigates this correlation and proposes a cross-modal speech co-learning paradigm. The primary motivation of our cross-modal co-learning method is modeling one modality aided by exploiting knowledge from another modality. Specifically, two cross-mod… ▽ More

    Submitted 22 February, 2023; originally announced February 2023.

  8. arXiv:2212.02084  [pdf, other

    cs.SD eess.AS

    End-to-end Recording Device Identification Based on Deep Representation Learning

    Authors: Chunyan Zeng, Dongliang Zhu, Zhifeng Wang, Minghu Wu, Wei Xiong, Nan Zhao

    Abstract: Deep learning techniques have achieved specific results in recording device source identification. The recording device source features include spatial information and certain temporal information. However, most recording device source identification methods based on deep learning only use spatial representation learning from recording device source features, which cannot make full use of recordin… ▽ More

    Submitted 5 December, 2022; originally announced December 2022.

    Comments: 20 pages, 5 figures, recording device identification

  9. arXiv:2211.05963  [pdf, other

    cs.CV eess.IV

    JSRNN: Joint Sampling and Reconstruction Neural Networks for High Quality Image Compressed Sensing

    Authors: Chunyan Zeng, Jiaxiang Ye, Zhifeng Wang, Nan Zhao, Minghu Wu

    Abstract: Most Deep Learning (DL) based Compressed Sensing (DCS) algorithms adopt a single neural network for signal reconstruction, and fail to jointly consider the influences of the sampling operation for reconstruction. In this paper, we propose unified framework, which jointly considers the sampling and reconstruction process for image compressive sensing based on well-designed cascade neural networks.… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

    Comments: 9 pages, 3 figures

  10. arXiv:2210.14666  [pdf, other

    eess.AS cs.SD

    Xiaoicesing 2: A High-Fidelity Singing Voice Synthesizer Based on Generative Adversarial Network

    Authors: Chunhui Wang, Chang Zeng, Xing He

    Abstract: XiaoiceSing is a singing voice synthesis (SVS) system that aims at generating 48kHz singing voices. However, the mel-spectrogram generated by it is over-smoothing in middle- and high-frequency areas due to no special design for modeling the details of these parts. In this paper, we propose XiaoiceSing2, which can generate the details of middle- and high-frequency parts to better construct the full… ▽ More

    Submitted 28 October, 2022; v1 submitted 26 October, 2022; originally announced October 2022.

    Comments: submitted to icassp2023

  11. arXiv:2210.12740  [pdf, other

    eess.AS cs.SD

    HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation

    Authors: Chunhui Wang, Chang Zeng, Jun Chen, Xing He

    Abstract: Entertainment-oriented singing voice synthesis (SVS) requires a vocoder to generate high-fidelity (e.g. 48kHz) audio. However, most text-to-speech (TTS) vocoders cannot reconstruct the waveform well in this scenario. In this paper, we propose HiFi-WaveGAN to synthesize the 48kHz high-quality singing voices in real-time. Specifically, it consists of an Extended WaveNet served as a generator, a mult… ▽ More

    Submitted 17 September, 2023; v1 submitted 23 October, 2022; originally announced October 2022.

  12. arXiv:2210.10506  [pdf, other

    cs.SD eess.AS

    Audio Tampering Detection Based on Shallow and Deep Feature Representation Learning

    Authors: Zhifeng Wang, Yao Yang, Chunyan Zeng, Shuai Kong, Shixiong Feng, Nan Zhao

    Abstract: Digital audio tampering detection can be used to verify the authenticity of digital audio. However, most current methods use standard electronic network frequency (ENF) databases for visual comparison analysis of ENF continuity of digital audio or perform feature extraction for classification by machine learning methods. ENF databases are usually tricky to obtain, visual methods have weak feature… ▽ More

    Submitted 19 October, 2022; originally announced October 2022.

    Comments: Audio tampering detection, 21 pages, 4 figures

  13. arXiv:2210.05254  [pdf, other

    cs.SD cs.AI eess.AS

    Deep Spectro-temporal Artifacts for Detecting Synthesized Speech

    Authors: Xiaohui Liu, Meng Liu, Lin Zhang, Linjuan Zhang, Chang Zeng, Kai Li, Nan Li, Kong Aik Lee, Longbiao Wang, Jianwu Dang

    Abstract: The Audio Deep Synthesis Detection (ADD) Challenge has been held to detect generated human-like speech. With our submitted system, this paper provides an overall assessment of track 1 (Low-quality Fake Audio Detection) and track 2 (Partially Fake Audio Detection). In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding featur… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

    Comments: 7 pages, 1 figures, Accecpted by Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia

  14. arXiv:2209.13761  [pdf, other

    eess.IV cs.CV cs.MM

    Image Compressed Sensing with Multi-scale Dilated Convolutional Neural Network

    Authors: Zhifeng Wang, Zhenghui Wang, Chunyan Zeng, Yan Yu, Xiangkui Wan

    Abstract: Deep Learning (DL) based Compressed Sensing (CS) has been applied for better performance of image reconstruction than traditional CS methods. However, most existing DL methods utilize the block-by-block measurement and each measurement block is restored separately, which introduces harmful blocking effects for reconstruction. Furthermore, the neuronal receptive fields of those methods are designed… ▽ More

    Submitted 27 September, 2022; originally announced September 2022.

    Comments: 28 pages, 8 figures, MsDCNN for CS

  15. arXiv:2209.00485  [pdf, other

    eess.AS cs.SD

    Joint Speaker Encoder and Neural Back-end Model for Fully End-to-End Automatic Speaker Verification with Multiple Enrollment Utterances

    Authors: Chang Zeng, Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi

    Abstract: Conventional automatic speaker verification systems can usually be decomposed into a front-end model such as time delay neural network (TDNN) for extracting speaker embeddings and a back-end model such as statistics-based probabilistic linear discriminant analysis (PLDA) or neural network-based neural PLDA (NPLDA) for similarity scoring. However, the sequential optimization of the front-end and ba… ▽ More

    Submitted 1 September, 2022; originally announced September 2022.

    Comments: Submitted to TASLP

  16. Spoofing-Aware Attention based ASV Back-end with Multiple Enrollment Utterances and a Sampling Strategy for the SASV Challenge 2022

    Authors: Chang Zeng, Lin Zhang, Meng Liu, Junichi Yamagishi

    Abstract: Current state-of-the-art automatic speaker verification (ASV) systems are vulnerable to presentation attacks, and several countermeasures (CMs), which distinguish bona fide trials from spoofing ones, have been explored to protect ASV. However, ASV systems and CMs are generally developed and optimized independently without considering their inter-relationship. In this paper, we propose a new spoofi… ▽ More

    Submitted 1 September, 2022; originally announced September 2022.

    Comments: Accepted by InterSpeech2022

  17. arXiv:2208.12753  [pdf, other

    cs.SD cs.AI eess.AS

    Spatio-Temporal Representation Learning Enhanced Source Cell-phone Recognition from Speech Recordings

    Authors: Chunyan Zeng, Shixiong Feng, Zhifeng Wang, Xiangkui Wan, Yunfan Chen, Nan Zhao

    Abstract: The existing source cell-phone recognition method lacks the long-term feature characterization of the source device, resulting in inaccurate representation of the source cell-phone related features which leads to insufficient recognition accuracy. In this paper, we propose a source cell-phone recognition method based on spatio-temporal representation learning, which includes two main parts: extrac… ▽ More

    Submitted 25 August, 2022; originally announced August 2022.

    Comments: 29 pages, 4 figures

  18. arXiv:2208.11920  [pdf

    cs.SD eess.AS

    Digital Audio Tampering Detection Based on ENF Spatio-temporal Features Representation Learning

    Authors: Chunyan Zeng, Shuai Kong, Zhifeng Wang, Xiangkui Wan, Yunfan Chen

    Abstract: Most digital audio tampering detection methods based on electrical network frequency (ENF) only utilize the static spatial information of ENF, ignoring the variation of ENF in time series, which limit the ability of ENF feature representation and reduce the accuracy of tampering detection. This paper proposes a new method for digital audio tampering detection based on ENF spatio-temporal features… ▽ More

    Submitted 25 August, 2022; originally announced August 2022.

    Comments: 19 pages, 6 figures

  19. arXiv:2208.08315  [pdf, other

    eess.IV cs.CV

    Video-TransUNet: Temporally Blended Vision Transformer for CT VFSS Instance Segmentation

    Authors: Chengxi Zeng, Xinyu Yang, Majid Mirmehdi, Alberto M Gambaruto, Tilo Burghardt

    Abstract: We propose Video-TransUNet, a deep architecture for instance segmentation in medical CT videos constructed by integrating temporal feature blending into the TransUNet deep learning framework. In particular, our approach amalgamates strong frame representation via a ResNet CNN backbone, multi-frame feature blending via a Temporal Context Module (TCM), non-local attention via a Vision Transformer, a… ▽ More

    Submitted 22 August, 2022; v1 submitted 17 August, 2022; originally announced August 2022.

    Comments: Accepted by International Conference on Machine Vision 2022

  20. arXiv:2208.02912  [pdf

    eess.IV cs.CV

    Unsupervised Tissue Segmentation via Deep Constrained Gaussian Network

    Authors: Yang Nan, Peng Tang, Guyue Zhang, Caihong Zeng, Zhihong Liu, Zhifan Gao, Heye Zhang, Guang Yang

    Abstract: Tissue segmentation is the mainstay of pathological examination, whereas the manual delineation is unduly burdensome. To assist this time-consuming and subjective manual step, researchers have devised methods to automatically segment structures in pathological images. Recently, automated machine and deep learning based methods dominate tissue segmentation research studies. However, most machine an… ▽ More

    Submitted 4 August, 2022; originally announced August 2022.

    Comments: 13 pages, 8 figures, accepted by IEEE TMI

  21. arXiv:2203.05847  [pdf

    eess.IV cs.AI cs.CV

    Automatic Fine-grained Glomerular Lesion Recognition in Kidney Pathology

    Authors: Yang Nan, Fengyi Li, Peng Tang, Guyue Zhang, Caihong Zeng, Guotong Xie, Zhihong Liu, Guang Yang

    Abstract: Recognition of glomeruli lesions is the key for diagnosis and treatment planning in kidney pathology; however, the coexisting glomerular structures such as mesangial regions exacerbate the difficulties of this task. In this paper, we introduce a scheme to recognize fine-grained glomeruli lesions from whole slide images. First, a focal instance structural similarity loss is proposed to drive the mo… ▽ More

    Submitted 11 March, 2022; originally announced March 2022.

    Comments: 33 pages, 6 figures, accepted by the Pattern Recognition journal

  22. arXiv:2112.00485  [pdf, other

    cs.CV eess.IV

    Learning Transformer Features for Image Quality Assessment

    Authors: Chao Zeng, Sam Kwong

    Abstract: Objective image quality evaluation is a challenging task, which aims to measure the quality of a given image automatically. According to the availability of the reference images, there are Full-Reference and No-Reference IQA tasks, respectively. Most deep learning approaches use regression from deep features extracted by Convolutional Neural Networks. For the FR task, another option is conducting… ▽ More

    Submitted 23 March, 2022; v1 submitted 1 December, 2021; originally announced December 2021.

  23. arXiv:2104.08510  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    Exploring Deep Learning for Joint Audio-Visual Lip Biometrics

    Authors: Meng Liu, Longbiao Wang, Kong Aik Lee, Hanyi Zhang, Chang Zeng, Jianwu Dang

    Abstract: Audio-visual (AV) lip biometrics is a promising authentication technique that leverages the benefits of both the audio and visual modalities in speech communication. Previous works have demonstrated the usefulness of AV lip biometrics. However, the lack of a sizeable AV database hinders the exploration of deep-learning-based audio-visual lip biometrics. To address this problem, we compile a modera… ▽ More

    Submitted 17 April, 2021; originally announced April 2021.

  24. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances

    Authors: Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi

    Abstract: Probabilistic linear discriminant analysis (PLDA) or cosine similarity have been widely used in traditional speaker verification systems as back-end techniques to measure pairwise similarities. To make better use of multiple enrollment utterances, we propose a novel attention back-end model, which can be used for both text-independent (TI) and text-dependent (TD) speaker verification, and employ s… ▽ More

    Submitted 5 October, 2021; v1 submitted 4 April, 2021; originally announced April 2021.

  25. Signal Detection in Distributed MIMO Radar with Non-Orthogonal Waveforms and Sync Errors

    Authors: Hongbin Li, Fangzhou Wang, Cengcang Zeng, Mark A. Govoni

    Abstract: Although routinely utilized in literature, orthogonal waveforms may lose orthogonality in distributed multi-input multi-output (MIMO) radar with spatially separated transmit (TX) and receive (RX) antennas, as the waveforms may experience distinct delays and Doppler frequency offsets unique to different TX-RX propagation paths. In such cases, the output of each waveform-specific matched filter (MF)… ▽ More

    Submitted 18 February, 2021; originally announced February 2021.

    Comments: 14 pages, 9 figures

  26. VerSe: A Vertebrae Labelling and Segmentation Benchmark for Multi-detector CT Images

    Authors: Anjany Sekuboyina, Malek E. Husseini, Amirhossein Bayat, Maximilian Löffler, Hans Liebl, Hongwei Li, Giles Tetteh, Jan Kukačka, Christian Payer, Darko Štern, Martin Urschler, Maodong Chen, Dalong Cheng, Nikolas Lessmann, Yu** Hu, Tianfu Wang, Dong Yang, Daguang Xu, Felix Ambellan, Tamaz Amiranashvili, Moritz Ehlke, Hans Lamecker, Sebastian Lehnert, Marilia Lirio, Nicolás Pérez de Olaguer , et al. (44 additional authors not shown)

    Abstract: Vertebral labelling and segmentation are two fundamental tasks in an automated spine processing pipeline. Reliable and accurate processing of spine images is expected to benefit clinical decision-support systems for diagnosis, surgery planning, and population-based analysis on spine and bone health. However, designing automated algorithms for spine processing is challenging predominantly due to co… ▽ More

    Submitted 5 April, 2022; v1 submitted 24 January, 2020; originally announced January 2020.

    Comments: Challenge report for the VerSe 2019 and 2020. Published in Medical Image Analysis (DOI: https://doi.org/10.1016/j.media.2021.102166)

    Journal ref: Medical Image Analysis, Volume 73, October 2021, 102166

  27. An Urban Water Extraction Method Combining Deep Learning and Google Earth Engine

    Authors: Yudie Wang, Zhiwei Li, Chao Zeng, Gui-Song Xia, Huanfeng Shen

    Abstract: Urban water is important for the urban ecosystem. Accurate and efficient detection of urban water with remote sensing data is of great significance for urban management and planning. In this paper, we proposed a new method to combine Google Earth Engine (GEE) with multiscale convolutional neural network (MSCNN) to extract urban water from Landsat images, which is summarized as offline training and… ▽ More

    Submitted 19 May, 2024; v1 submitted 23 December, 2019; originally announced December 2019.

    Comments: This manuscript has been accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 769-782, 2020

    Journal ref: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 769-782, 2020

  28. arXiv:1909.09316  [pdf

    physics.ao-ph eess.IV physics.data-an

    Spatially Continuous and High-resolution Land Surface Temperature: A Review of Reconstruction and Spatiotemporal Fusion Techniques

    Authors: Penghai Wu, Zhixiang Yin, Chao Zeng, Sibo Duan, Frank-Michael Gottsche, Xiaoshaung Ma, Xinghua Li, Hui Yang, Huanfeng Shen

    Abstract: Remotely sensed, spatially continuous and high spatiotemporal resolution (hereafter referred to as high resolution) land surface temperature (LST) is a key parameter for studying the thermal environment and has important applications in many fields. However, difficult atmospheric conditions, sensor malfunctioning and scanning gaps between orbits frequently introduce spatial discontinuities into sa… ▽ More

    Submitted 20 September, 2019; originally announced September 2019.

    Comments: 41 pages, 7 figures, 2 tables

  29. An Implementation of List Successive Cancellation Decoder with Large List Size for Polar Codes

    Authors: ChenYang Xia, YouZhe Fan, Ji Chen, Chi-ying Tsui, ChongYang Zeng, Jie **, Bin Li

    Abstract: Polar codes are the first class of forward error correction (FEC) codes with a provably capacity-achieving capability. Using list successive cancellation decoding (LSCD) with a large list size, the error correction performance of polar codes exceeds other well-known FEC codes. However, the hardware complexity of LSCD rapidly increases with the list size, which incurs high usage of the resources on… ▽ More

    Submitted 8 May, 2018; originally announced May 2018.

    Comments: 4 pages, 4 figures, 4 tables, Published in 27th International Conference on Field Programmable Logic and Applications (FPL), 2017