Skip to main content

Showing 1–50 of 61 results for author: Song, K

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.05763  [pdf, other

    eess.AS

    WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark

    Authors: Linhan Ma, Dake Guo, Kun Song, Yuepeng Jiang, Shuai Wang, Liumeng Xue, Weiming Xu, Huan Zhao, Binbin Zhang, Lei Xie

    Abstract: With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio… ▽ More

    Submitted 19 June, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH2024

  2. arXiv:2403.03100  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

    Authors: Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, **yu Li, Sheng Zhao

    Abstract: While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing di… ▽ More

    Submitted 23 April, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

    Comments: Achieving human-level quality and naturalness on multi-speaker datasets (e.g., LibriSpeech) in a zero-shot way

  3. arXiv:2401.10278  [pdf, other

    eess.SP cs.AI cs.LG cs.MM q-bio.NC

    EEGFormer: Towards Transferable and Interpretable Large-Scale EEG Foundation Model

    Authors: Yuqi Chen, Kan Ren, Kaitao Song, Yansen Wang, Yifan Wang, Dongsheng Li, Lili Qiu

    Abstract: Self-supervised learning has emerged as a highly effective approach in the fields of natural language processing and computer vision. It is also applicable to brain signals such as electroencephalography (EEG) data, given the abundance of available unlabeled data that exist in a wide spectrum of real-world medical applications ranging from seizure detection to wave analysis. The existing works lev… ▽ More

    Submitted 11 January, 2024; originally announced January 2024.

    Comments: A preprint version of an ongoing work

  4. arXiv:2401.01721  [pdf, other

    cs.IT eess.SP

    Limited Feedback on Measurements: Sharing a Codebook or a Generative Model?

    Authors: Nurettin Turan, Benedikt Fesl, Michael Joham, Zhengxiang Ma, Anthony C. K. Soong, Baoling Sheen, Weimin Xiao, Wolfgang Utschick

    Abstract: Discrete Fourier transform (DFT) codebook-based solutions are well-established for limited feedback schemes in frequency division duplex (FDD) systems. In recent years, data-aided solutions have been shown to achieve higher performance, enabled by the adaptivity of the feedback scheme to the propagation environment of the base station (BS) cell. In particular, a versatile limited feedback scheme u… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

  5. arXiv:2312.14511  [pdf

    cs.RO eess.SY

    3D Programming of Patterned Heterogeneous Interface for 4D Smart Robotics

    Authors: Kewei Song, Chunfeng Xiong, Ze Zhang, Kunlin Wu, Weiyang Wan, Yifan Wang, Shinjiro Umezu, Hirotaka Sato

    Abstract: Shape memory structures are playing an important role in many cutting-edge intelligent fields. However, the existing technologies can only realize 4D printing of a single polymer or metal, which limits practical applications. Here, we report a construction strategy for TSMP/M heterointerface, which uses Pd2+-containing shape memory polymer (AP-SMR) to induce electroless plating reaction and relies… ▽ More

    Submitted 22 December, 2023; originally announced December 2023.

    Comments: 37 Pages, 11 Figures

  6. arXiv:2310.16869  [pdf

    eess.IV physics.optics

    Single-pixel imaging based on deep learning

    Authors: Kai Song, Yaoxing Bian, Ku Wu, Hongrui Liu, Shuang** Han, Jiaming Li, Jiazhao Tian, Chengbin Qin, Jianyong Hu, Liantuan Xiao

    Abstract: Single-pixel imaging can collect images at the wavelengths outside the reach of conventional focal plane array detectors. However, the limited image quality and lengthy computational times for iterative reconstruction still impede the practical application of single-pixel imaging. Recently, deep learning has been introduced into single-pixel imaging, which has attracted a lot of attention due to i… ▽ More

    Submitted 16 November, 2023; v1 submitted 25 October, 2023; originally announced October 2023.

  7. arXiv:2310.11954  [pdf, other

    cs.CL cs.MM eess.AS

    MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models

    Authors: Dingyao Yu, Kaitao Song, Peiling Lu, Tianyu He, Xu Tan, Wei Ye, Shikun Zhang, Jiang Bian

    Abstract: AI-empowered music processing is a diverse field that encompasses dozens of tasks, ranging from generation tasks (e.g., timbre synthesis) to comprehension tasks (e.g., music classification). For developers and amateurs, it is very difficult to grasp all of these task to satisfy their requirements in music processing, especially considering the huge differences in the representations of music data… ▽ More

    Submitted 25 October, 2023; v1 submitted 18 October, 2023; originally announced October 2023.

  8. arXiv:2309.02285  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    PromptTTS 2: Describing and Generating Voices with Text Prompt

    Authors: Yichong Leng, Zhifang Guo, Kai Shen, Xu Tan, Zeqian Ju, Yanqing Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He, Xiang-Yang Li, Sheng Zhao, Tao Qin, Jiang Bian

    Abstract: Speech conveys more information than text, as the same word can be uttered in various voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using text prompts (descriptions) is more user-friendly since speech prompts can be hard to find or may not exist at all. TTS approaches based on the text… ▽ More

    Submitted 11 October, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

    Comments: Demo page: https://speechresearch.github.io/prompttts2

  9. arXiv:2307.04630  [pdf, other

    cs.SD eess.AS

    The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task

    Authors: Kun Song, Yi lei, Peikun Chen, Yiqing Cao, Kun Wei, Yongmao Zhang, Lei Xie, Ning Jiang, Guoqing Zhao

    Abstract: This paper describes the NPU-MSXF system for the IWSLT 2023 speech-to-speech translation (S2ST) task which aims to translate from English speech of multi-source to Chinese speech. The system is built in a cascaded manner consisting of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS). We make tremendous efforts to handle the challenging multi-source input. Spec… ▽ More

    Submitted 10 July, 2023; originally announced July 2023.

    Comments: IWSLT@ACL 2023 system paper. Our submitted system ranks 1st in the S2ST task of the IWSLT 2023 evaluation campaign

  10. ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading

    Authors: Yujia Xiao, Shaofei Zhang, Xi Wang, Xu Tan, Lei He, Sheng Zhao, Frank K. Soong, Tan Lee

    Abstract: While state-of-the-art Text-to-Speech systems can generate natural speech of very high quality at sentence level, they still meet great challenges in speech generation for paragraph / long-form reading. Such deficiencies are due to i) ignorance of cross-sentence contextual information, and ii) high computation and memory cost for long-form synthesis. To address these issues, this work develops a l… ▽ More

    Submitted 7 October, 2023; v1 submitted 3 July, 2023; originally announced July 2023.

    Comments: 5 pages, 4 figures, Proceedings of Interspeech 2023

  11. arXiv:2306.05629  [pdf, other

    cs.IT eess.SY

    R-PMAC: A Robust Preamble Based MAC Mechanism Applied in Industrial Internet of Things

    Authors: Kai Song, Biqian Feng, Yongpeng Wu, Zhen Gao, Wenjun Zhang

    Abstract: This paper proposes a novel media access control (MAC) mechanism, called the robust preamble-based MAC mechanism (R-PMAC), which can be applied to power line communication (PLC) networks in the context of the Industrial Internet of Things (IIoT). Compared with other MAC mechanisms such as P-MAC and the MAC layer of IEEE1901.1, R-PMAC has higher networking speed. Besides, it supports whitelist auth… ▽ More

    Submitted 8 June, 2023; originally announced June 2023.

    Comments: This paper has been accepted by IEEE Internet of Things Journal

  12. arXiv:2306.02682  [pdf, other

    cs.CL eess.AS

    End-to-End Word-Level Pronunciation Assessment with MASK Pre-training

    Authors: Yukang Liang, Kaitao Song, Shaoguang Mao, Huiqiang Jiang, Luna Qiu, Yuqing Yang, Dongsheng Li, Linli Xu, Lili Qiu

    Abstract: Pronunciation assessment is a major challenge in the computer-aided pronunciation training system, especially at the word (phoneme)-level. To obtain word (phoneme)-level scores, current methods usually rely on aligning components to obtain acoustic features of each word (phoneme), which limits the performance of assessment to the accuracy of alignments. Therefore, to address this problem, we propo… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: Accepted by InterSpeech 2023

  13. arXiv:2305.17732  [pdf, other

    cs.SD eess.AS

    StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation

    Authors: Kun Song, Yi Ren, Yi Lei, Chunfeng Wang, Kun Wei, Lei Xie, Xiang Yin, Zejun Ma

    Abstract: Direct speech-to-speech translation (S2ST) has gradually become popular as it has many advantages compared with cascade S2ST. However, current research mainly focuses on the accuracy of semantic translation and ignores the speech style transfer from a source language to a target language. The lack of high-fidelity expressive parallel data makes such style transfer challenging, especially in more p… ▽ More

    Submitted 25 July, 2023; v1 submitted 28 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023

  14. arXiv:2303.08027  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    A Hierarchical Regression Chain Framework for Affective Vocal Burst Recognition

    Authors: **chao Li, Xixin Wu, Kaitao Song, Dongsheng Li, Xunying Liu, Helen Meng

    Abstract: As a common way of emotion signaling via non-linguistic vocalizations, vocal burst (VB) plays an important role in daily social interaction. Understanding and modeling human vocal bursts are indispensable for develo** robust and general artificial intelligence. Exploring computational approaches for understanding vocal bursts is attracting increasing research attention. In this work, we propose… ▽ More

    Submitted 14 March, 2023; originally announced March 2023.

    Comments: 5 pages, 3 figures, 5 tables

  15. arXiv:2303.08019  [pdf, other

    eess.AS cs.LG cs.SD q-bio.QM

    Leveraging Pretrained Representations with Task-related Keywords for Alzheimer's Disease Detection

    Authors: **chao Li, Kaitao Song, Junan Li, Bo Zheng, Dongsheng Li, Xixin Wu, Xunying Liu, Helen Meng

    Abstract: With the global population aging rapidly, Alzheimer's disease (AD) is particularly prominent in older adults, which has an insidious onset and leads to a gradual, irreversible deterioration in cognitive domains (memory, communication, etc.). Speech-based AD detection opens up the possibility of widespread screening and timely disease intervention. Recent advances in pre-trained models motivate AD… ▽ More

    Submitted 14 March, 2023; originally announced March 2023.

    Comments: 5 pages, 3 figures, 3 tables

  16. arXiv:2212.01039  [pdf, other

    cs.CL cs.LG eess.AS

    SoftCorrect: Error Correction with Soft Detection for Automatic Speech Recognition

    Authors: Yichong Leng, Xu Tan, Wenjie Liu, Kaitao Song, Rui Wang, Xiang-Yang Li, Tao Qin, Edward Lin, Tie-Yan Liu

    Abstract: Error correction in automatic speech recognition (ASR) aims to correct those incorrect words in sentences generated by ASR models. Since recent ASR models usually have low word error rate (WER), to avoid affecting originally correct tokens, error correction models should only modify incorrect words, and therefore detecting incorrect words is important for error correction. Previous works on error… ▽ More

    Submitted 20 December, 2023; v1 submitted 2 December, 2022; originally announced December 2022.

    Comments: AAAI 2023

  17. arXiv:2211.10568  [pdf, other

    eess.AS cs.SD

    Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling

    Authors: Xinfa Zhu, Yi Lei, Kun Song, Yongmao Zhang, Tao Li, Lei Xie

    Abstract: This paper aims to synthesize the target speaker's speech with desired speaking style and emotion by transferring the style and emotion from reference speech recorded by other speakers. We address this challenging problem with a two-stage framework composed of a text-to-style-and-emotion (Text2SE) module and a style-and-emotion-to-wave (SE2Wave) module, bridging by neural bottleneck (BN) features.… ▽ More

    Submitted 14 March, 2023; v1 submitted 18 November, 2022; originally announced November 2022.

    Comments: Accepted by ICASSP2023

  18. arXiv:2211.01087  [pdf, other

    cs.SD eess.AS

    DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP

    Authors: Kun Song, Yongmao Zhang, Yi Lei, Jian Cong, Hanzhao Li, Lei Xie, Gang He, **feng Bai

    Abstract: Recent development of neural vocoders based on the generative adversarial neural network (GAN) has shown obvious advantages of generating raw waveform conditioned on mel-spectrogram with fast inference speed and lightweight networks. Whereas, it is still challenging to train a universal neural vocoder that can synthesize high-fidelity speech from various scenarios with unseen speakers, languages,… ▽ More

    Submitted 28 May, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: Accepted to ICASSP 2023

  19. arXiv:2210.17349  [pdf, other

    cs.SD eess.AS

    Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS

    Authors: Kun Song, Jian Cong, Xinsheng Wang, Yongmao Zhang, Lei Xie, Ning Jiang, Haiying Wu

    Abstract: In current two-stage neural text-to-speech (TTS) paradigm, it is ideal to have a universal neural vocoder, once trained, which is robust to imperfect mel-spectrogram predicted from the acoustic model. To this end, we propose Robust MelGAN vocoder by solving the original multi-band MelGAN's metallic sound problem and increasing its generalization ability. Specifically, we introduce a fine-grained n… ▽ More

    Submitted 2 November, 2022; v1 submitted 31 October, 2022; originally announced October 2022.

    Comments: Accepted by ISCSLP 2022

  20. 3D Matting: A Benchmark Study on Soft Segmentation Method for Pulmonary Nodules Applied in Computed Tomography

    Authors: Lin Wang, Xiufen Ye, Donghao Zhang, Wanji He, Lie Ju, Yi Luo, Huan Luo, Xin Wang, Wei Feng, Kaimin Song, Xin Zhao, Zongyuan Ge

    Abstract: Usually, lesions are not isolated but are associated with the surrounding tissues. For example, the growth of a tumour can depend on or infiltrate into the surrounding tissues. Due to the pathological nature of the lesions, it is challenging to distinguish their boundaries in medical imaging. However, these uncertain regions may contain diagnostic information. Therefore, the simple binarization of… ▽ More

    Submitted 10 October, 2022; originally announced October 2022.

    Comments: Accepted by Computers in Biology and Medicine. arXiv admin note: substantial text overlap with arXiv:2209.07843

  21. arXiv:2209.10887  [pdf, other

    cs.SD cs.CL eess.AS

    A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS

    Authors: Haohan Guo, Fenglong Xie, Frank K. Soong, Xixin Wu, Helen Meng

    Abstract: We propose a Multi-Stage, Multi-Codebook (MSMC) approach to high-performance neural TTS synthesis. A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data by down-sampling progressively in multiple stages into MSMC Representations (MSMCRs) with different time resolutions, and quantizing them with multiple VQ codebooks,… ▽ More

    Submitted 22 September, 2022; originally announced September 2022.

  22. arXiv:2209.07843  [pdf, other

    eess.IV cs.CV

    3D Matting: A Soft Segmentation Method Applied in Computed Tomography

    Authors: Lin Wang, Xiufen Ye, Donghao Zhang, Wanji He, Lie Ju, Xin Wang, Wei Feng, Kaimin Song, Xin Zhao, Zongyuan Ge

    Abstract: Three-dimensional (3D) images, such as CT, MRI, and PET, are common in medical imaging applications and important in clinical diagnosis. Semantic ambiguity is a typical feature of many medical image labels. It can be caused by many factors, such as the imaging properties, pathological anatomy, and the weak representation of the binary masks, which brings challenges to accurate 3D segmentation. In… ▽ More

    Submitted 16 September, 2022; originally announced September 2022.

    Comments: 12 pages, 7 figures

  23. arXiv:2209.06484  [pdf, other

    cs.SD cs.CL eess.AS

    ParaTTS: Learning Linguistic and Prosodic Cross-sentence Information in Paragraph-based TTS

    Authors: Liumeng Xue, Frank K. Soong, Shaofei Zhang, Lei Xie

    Abstract: Recent advancements in neural end-to-end TTS models have shown high-quality, natural synthesized speech in a conventional sentence-based TTS. However, it is still challenging to reproduce similar high quality when a whole paragraph is considered in TTS, where a large amount of contextual information needs to be considered in building a paragraph-based TTS model. To alleviate the difficulty in trai… ▽ More

    Submitted 14 September, 2022; originally announced September 2022.

    Comments: Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing

  24. arXiv:2208.05122  [pdf, other

    eess.AS

    Improving Hypernasality Estimation with Automatic Speech Recognition in Cleft Palate Speech

    Authors: Kaitao Song, Teng Wan, Bixia Wang, Huiqiang Jiang, Luna Qiu, Jiahang Xu, Li** Jiang, Qun Lou, Yuqing Yang, Dongsheng Li, Xudong Wang, Lili Qiu

    Abstract: Hypernasality is an abnormal resonance in human speech production, especially in patients with craniofacial anomalies such as cleft palate. In clinical application, hypernasality estimation is crucial in cleft palate diagnosis, as its results determine the subsequent surgery and additional speech therapy. Therefore, designing an automatic hypernasality assessment method will facilitate speech-lang… ▽ More

    Submitted 9 August, 2022; originally announced August 2022.

    Comments: Accepted by InterSpeech 2022

  25. arXiv:2207.02399  [pdf

    eess.IV cs.CV

    Learning Apparent Diffusion Coefficient Maps from Accelerated Radial k-Space Diffusion-Weighted MRI in Mice using a Deep CNN-Transformer Model

    Authors: Yuemeng Li, Miguel Romanello Joaquim, Stephen Pickup, Hee Kwon Song, Rong Zhou, Yong Fan

    Abstract: Purpose: To accelerate radially sampled diffusion weighted spin-echo (Rad-DW-SE) acquisition method for generating high quality apparent diffusion coefficient (ADC) maps. Methods: A deep learning method was developed to generate accurate ADC maps from accelerated DWI data acquired with the Rad-DW-SE method. The deep learning method integrates convolutional neural networks (CNNs) with vision transf… ▽ More

    Submitted 1 August, 2023; v1 submitted 5 July, 2022; originally announced July 2022.

    Comments: Accepted by Magnetic Resonance in Medicine

    Journal ref: Magn Reson Med 2023

  26. arXiv:2206.00208  [pdf, other

    cs.SD eess.AS

    AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation

    Authors: Kun Song, Heyang Xue, Xinsheng Wang, Jian Cong, Yongmao Zhang, Lei Xie, Bing Yang, Xiong Zhang, Dan Su

    Abstract: Speaker adaptation in text-to-speech synthesis (TTS) is to finetune a pre-trained TTS model to adapt to new target speakers with limited data. While much effort has been conducted towards this task, seldom work has been performed for low computational resource scenarios due to the challenges raised by the requirement of the lightweight model and less computational complexity. In this paper, a tiny… ▽ More

    Submitted 2 November, 2022; v1 submitted 31 May, 2022; originally announced June 2022.

    Comments: Accepted by ISCSLP 2022

  27. arXiv:2204.09679  [pdf, other

    cs.CV cs.LG eess.IV

    FS-NCSR: Increasing Diversity of the Super-Resolution Space via Frequency Separation and Noise-Conditioned Normalizing Flow

    Authors: Ki-Ung Song, Dongseok Shim, Kang-wook Kim, Jae-young Lee, Younggeun Kim

    Abstract: Super-resolution suffers from an innate ill-posed problem that a single low-resolution (LR) image can be from multiple high-resolution (HR) images. Recent studies on the flow-based algorithm solve this ill-posedness by learning the super-resolution space and predicting diverse HR outputs. Unfortunately, the diversity of the super-resolution outputs is still unsatisfactory, and the outputs from the… ▽ More

    Submitted 20 April, 2022; originally announced April 2022.

    Comments: CVPRW 2022, First three authors are equally contributed

  28. arXiv:2204.03703  [pdf, other

    eess.IV cs.LG eess.SP

    Physics-assisted Generative Adversarial Network for X-Ray Tomography

    Authors: Zhen Guo, Jung Ki Song, George Barbastathis, Michael E. Glinsky, Courtenay T. Vaughan, Kurt W. Larson, Bradley K. Alpert, Zachary H. Levine

    Abstract: X-ray tomography is capable of imaging the interior of objects in three dimensions non-invasively, with applications in biomedical imaging, materials science, electronic inspection, and other fields. The reconstruction process can be an ill-conditioned inverse problem, requiring regularization to obtain satisfactory results. Recently, deep learning has been adopted for tomographic reconstruction.… ▽ More

    Submitted 3 June, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

    Comments: arXiv admin note: text overlap with arXiv:2111.08011

  29. arXiv:2203.17190  [pdf, other

    eess.AS cs.CL

    Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech

    Authors: Guangyan Zhang, Kaitao Song, Xu Tan, Daxin Tan, Yuzi Yan, Yanqing Liu, Gang Wang, Wei Zhou, Tao Qin, Tan Lee, Sheng Zhao

    Abstract: Recently, leveraging BERT pre-training to improve the phoneme encoder in text to speech (TTS) has drawn increasing attention. However, the works apply pre-training with character-based units to enhance the TTS phoneme encoder, which is inconsistent with the TTS fine-tuning that takes phonemes as input. Pre-training only with phonemes as input can alleviate the input mismatch but lack the ability t… ▽ More

    Submitted 19 July, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

    Comments: Accepted by interspeech 2022

  30. arXiv:2203.11404  [pdf, other

    eess.SP cs.IT

    Enhanced Preamble Based MAC Mechanism for IIoT-oriented PLC Network

    Authors: Kai Song, Biqian Feng, Yongpeng Wu, Wenjun Zhang

    Abstract: In this paper, we propose an enhanced preamble based media access control mechanism (E-PMAC), which can be applied in power line communication (PLC) network for Industrial Internet of Things (IIoT). We introduce detailed technologies used in E-PMAC, including delay calibration mechanism, preamble design, and slot allocation algorithm. With these technologies, E-PMAC is more robust than existing pr… ▽ More

    Submitted 21 March, 2022; originally announced March 2022.

    Comments: 7 pages, 12 figures, to appeal in The 2022 IEEE 95th Vehicular Technology Conference (VTC2022-Spring)

  31. arXiv:2201.09472  [pdf, other

    cs.SD eess.AS

    Disentangling Style and Speaker Attributes for TTS Style Transfer

    Authors: Xiaochun An, Frank K. Soong, Lei Xie

    Abstract: End-to-end neural TTS has shown improved performance in speech style transfer. However, the improvement is still limited by the available training data in both target styles and speakers. Additionally, degenerated performance is observed when the trained TTS tries to transfer the speech to a target style from a new speaker with an unknown, arbitrary style. In this paper, we propose a new approach… ▽ More

    Submitted 24 January, 2022; originally announced January 2022.

  32. arXiv:2112.11661  [pdf

    cs.RO eess.SY physics.app-ph physics.chem-ph

    New metal-plastic hybrid additive manufacturing strategy: Fabrication of arbitrary metal-patterns on external and even internal surfaces of 3D plastic structures

    Authors: Kewei Song, Yue Cui, Tiannan Tao, Xiangyi Meng, Michinari Sone, Masahiro Yoshino, Shinjiro Umezu, Hirotaka Sato

    Abstract: Constructing precise micro-nano metal patterns on complex three-dimensional (3D) plastic parts allows the fabrication of functional devices for advanced applications. However, this patterning is currently expensive and requires complex processes with long manufacturing lead time. The present work demonstrates a process for the fabrication of micro-nano 3D metal-plastic composite structures with ar… ▽ More

    Submitted 21 December, 2021; originally announced December 2021.

  33. arXiv:2111.08011  [pdf, other

    eess.IV cs.LG eess.SP

    Advantage of Machine Learning over Maximum Likelihood in Limited-Angle Low-Photon X-Ray Tomography

    Authors: Zhen Guo, Jung Ki Song, George Barbastathis, Michael E. Glinsky, Courtenay T. Vaughan, Kurt W. Larson, Bradley K. Alpert, Zachary H. Levine

    Abstract: Limited-angle X-ray tomography reconstruction is an ill-conditioned inverse problem in general. Especially when the projection angles are limited and the measurements are taken in a photon-limited condition, reconstructions from classical algorithms such as filtered backprojection may lose fidelity and acquire artifacts due to the missing-cone problem. To obtain satisfactory reconstruction results… ▽ More

    Submitted 18 December, 2021; v1 submitted 15 November, 2021; originally announced November 2021.

    Comments: To appear, Machine Learning for Scientific Imaging 2022 Conference, at IS&T Electronic Imaging 2022. 6 pages, 4 figures

  34. arXiv:2110.11827  [pdf, other

    cs.IT eess.SP

    Uniquely Decodable Multi-Amplitude Sequence for Grant-Free Multiple-Access Adder Channels

    Authors: Qi-Yue Yu, Ke-Xun Song

    Abstract: Grant-free multiple-access (GFMA) is a valuable research topic, since it can support multiuser transmission with low latency. This paper constructs novel uniquely-decodable multi-amplitude sequence (UDAS) sets for GFMA systems, which can provide high spectrum efficiency (SE) with low-complexity active user detection (AUD) algorithm. First of all, we propose an UDAS-based multi-dimensional bit inte… ▽ More

    Submitted 8 April, 2022; v1 submitted 22 October, 2021; originally announced October 2021.

    Comments: 29 pages, 7 figures

  35. arXiv:2110.09698  [pdf, other

    cs.SD cs.CL eess.AS

    Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

    Authors: Mutian He, **gzhou Yang, Lei He, Frank K. Soong

    Abstract: End-to-end TTS requires a large amount of speech/text paired data to cover all necessary knowledge, particularly how to pronounce different words in diverse contexts, so that a neural model may learn such knowledge accordingly. But in real applications, such high demand of training data is hard to be satisfied and additional knowledge often needs to be injected manually. For example, to capture pr… ▽ More

    Submitted 24 June, 2022; v1 submitted 18 October, 2021; originally announced October 2021.

    Comments: 5 pages, 3 figures; accepted by Interspeech 2022

  36. arXiv:2110.03857  [pdf, other

    eess.AS cs.CL cs.SD

    A study on the efficacy of model pre-training in develo** neural text-to-speech system

    Authors: Guangyan Zhang, Yichong Leng, Daxin Tan, Ying Qin, Kaitao Song, Xu Tan, Sheng Zhao, Tan Lee

    Abstract: In the development of neural text-to-speech systems, model pre-training with a large amount of non-target speakers' data is a common approach. However, in terms of ultimately achieved system performance for target speaker(s), the actual benefits of model pre-training are uncertain and unstable, depending very much on the quantity and text content of training data. This study aims to understand bet… ▽ More

    Submitted 7 October, 2021; originally announced October 2021.

  37. arXiv:2108.01821  [pdf, other

    eess.IV cs.CV

    Unsupervised Domain Adaptation for Retinal Vessel Segmentation with Adversarial Learning and Transfer Normalization

    Authors: Wei Feng, Lie Ju, Lin Wang, Kaimin Song, Xin Wang, Xin Zhao, Qingyi Tao, Zongyuan Ge

    Abstract: Retinal vessel segmentation plays a key role in computer-aided screening, diagnosis, and treatment of various cardiovascular and ophthalmic diseases. Recently, deep learning-based retinal vessel segmentation algorithms have achieved remarkable performance. However, due to the domain shift problem, the performance of these algorithms often degrades when they are applied to new data that is differen… ▽ More

    Submitted 3 August, 2021; originally announced August 2021.

  38. arXiv:2107.09561  [pdf, other

    eess.SP cs.IT

    Explicit Calibration of mmWave Phased Arrays with Phase Dependent Errors

    Authors: Joyson Sebastian, Pranav Dayal, Kee-Bong Song, Walid AliAhmad

    Abstract: We consider an error model for phased array with gain errors and phase errors, with errors dependent on the phase applied and the antenna index. Under this model, we propose an algorithm for measuring the errors by selectively turning on the antennas at specific phases and measuring the transmitted power. In our algorithm, the antennas are turned on individually and then pairwise for the measureme… ▽ More

    Submitted 16 July, 2021; originally announced July 2021.

  39. arXiv:2107.01875  [pdf, other

    cs.SD cs.AI cs.CL cs.LG eess.AS

    DeepRapper: Neural Rap Generation with Rhyme and Rhythm Modeling

    Authors: Lanqing Xue, Kaitao Song, Duocai Wu, Xu Tan, Nevin L. Zhang, Tao Qin, Wei-Qiang Zhang, Tie-Yan Liu

    Abstract: Rap generation, which aims to produce lyrics and corresponding singing beats, needs to model both rhymes and rhythms. Previous works for rap generation focused on rhyming lyrics but ignored rhythmic beats, which are important for rap performance. In this paper, we develop DeepRapper, a Transformer-based rap generation system that can model both rhymes and rhythms. Since there is no available rap d… ▽ More

    Submitted 5 July, 2021; originally announced July 2021.

    Comments: Accepted by ACL 2021 main conference

  40. arXiv:2106.10003  [pdf, other

    cs.SD eess.AS

    Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS

    Authors: Xiaochun An, Frank K. Soong, Lei Xie

    Abstract: End-to-end neural TTS training has shown improved performance in speech style transfer. However, the improvement is still limited by the training data in both target styles and speakers. Inadequate style transfer performance occurs when the trained TTS tries to transfer the speech to a target style from a new speaker with an unknown, arbitrary style. In this paper, we propose a new approach to sty… ▽ More

    Submitted 18 June, 2021; originally announced June 2021.

  41. arXiv:2106.04312  [pdf, other

    eess.AS cs.SD

    Speech BERT Embedding For Improving Prosody in Neural TTS

    Authors: Li** Chen, Yan Deng, Xi Wang, Frank K. Soong, Lei He

    Abstract: This paper presents a speech BERT model to extract embedded prosody information in speech segments for improving the prosody of synthesized speech in neural text-to-speech (TTS). As a pre-trained model, it can learn prosody attributes from a large amount of speech data, which can utilize more data than the original training data used by the target TTS. The embedding is extracted from the previous… ▽ More

    Submitted 14 September, 2021; v1 submitted 8 June, 2021; originally announced June 2021.

    Journal ref: ICASSP 2021

  42. arXiv:2104.01818  [pdf, other

    eess.AS

    The Multi-speaker Multi-style Voice Cloning Challenge 2021

    Authors: Qicong Xie, Xiaohai Tian, Guanghou Liu, Kun Song, Lei Xie, Zhiyong Wu, Hai Li, Song Shi, Haizhou Li, Fen Hong, Hui Bu, Xin Xu

    Abstract: The Multi-speaker Multi-style Voice Cloning Challenge (M2VoC) aims to provide a common sizable dataset as well as a fair testbed for the benchmarking of the popular voice cloning task. Specifically, we formulate the challenge to adapt an average TTS model to the stylistic target voice with limited data from target speaker, evaluated by speaker identity and style similarity. The challenge consists… ▽ More

    Submitted 5 April, 2021; originally announced April 2021.

    Comments: has been accepted to ICASSP 2021

  43. arXiv:2103.08765  [pdf, other

    eess.SP cs.IT

    Data Discovery Using Lossless Compression-Based Sparse Representation

    Authors: Elyas Sabeti, Peter X. K. Song, Alfred O. Hero III

    Abstract: Sparse representation has been widely used in data compression, signal and image denoising, dimensionality reduction and computer vision. While overcomplete dictionaries are required for sparse representation of multidimensional data, orthogonal bases represent one-dimensional data well. In this paper, we propose a data-driven sparse representation using orthonormal bases under the lossless compre… ▽ More

    Submitted 16 March, 2021; v1 submitted 15 March, 2021; originally announced March 2021.

  44. arXiv:2103.03541  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Multilingual Byte2Speech Models for Scalable Low-resource Speech Synthesis

    Authors: Mutian He, **gzhou Yang, Lei He, Frank K. Soong

    Abstract: To scale neural speech synthesis to various real-world languages, we present a multilingual end-to-end framework that maps byte inputs to spectrograms, thus allowing arbitrary input scripts. Besides strong results on 40+ languages, the framework demonstrates capabilities to adapt to new languages under extreme low-resource and even few-shot scenarios of merely 40s transcribed recording, without th… ▽ More

    Submitted 9 July, 2021; v1 submitted 5 March, 2021; originally announced March 2021.

    Comments: 17 pages

  45. arXiv:2012.05168  [pdf, other

    cs.SD cs.CL eess.AS

    SongMASS: Automatic Song Writing with Pre-training and Alignment Constraint

    Authors: Zhonghao Sheng, Kaitao Song, Xu Tan, Yi Ren, Wei Ye, Shikun Zhang, Tao Qin

    Abstract: Automatic song writing aims to compose a song (lyric and/or melody) by machine, which is an interesting topic in both academia and industry. In automatic song writing, lyric-to-melody generation and melody-to-lyric generation are two important tasks, both of which usually suffer from the following challenges: 1) the paired lyric and melody data are limited, which affects the generation quality of… ▽ More

    Submitted 9 December, 2020; originally announced December 2020.

  46. arXiv:2011.08480  [pdf, other

    eess.AS cs.SD

    s-Transformer: Segment-Transformer for Robust Neural Speech Synthesis

    Authors: Xi Wang, Huai** Ming, Lei He, Frank K. Soong

    Abstract: Neural end-to-end text-to-speech (TTS) , which adopts either a recurrent model, e.g. Tacotron, or an attention one, e.g. Transformer, to characterize a speech utterance, has achieved significant improvement of speech synthesis. However, it is still very challenging to deal with different sentence lengths, particularly, for long sentences where sequence model has limitation of the effective context… ▽ More

    Submitted 17 November, 2020; originally announced November 2020.

    Comments: 5 pages, 5 figures

  47. Adaptive multi-channel event segmentation and feature extraction for monitoring health outcomes

    Authors: Xichen She, Yaya Zhai, Ricardo Henao, Christopher W. Woods, Christopher Chiu, Geoffrey S. Ginsburg, Peter X. K. Song, Alfred O. Hero

    Abstract: $\textbf{Objective}$: To develop a multi-channel device event segmentation and feature extraction algorithm that is robust to changes in data distribution. $\textbf{Methods}… ▽ More

    Submitted 19 November, 2020; v1 submitted 20 August, 2020; originally announced August 2020.

    Journal ref: IEEE Transactions on Biomedical Engineering, Nov. 17 2020

  48. arXiv:2008.04658  [pdf, other

    eess.AS cs.SD

    Transfer Learning for Improving Singing-voice Detection in Polyphonic Instrumental Music

    Authors: Yuanbo Hou, Frank K. Soong, Jian Luan, Shengchen Li

    Abstract: Detecting singing-voice in polyphonic instrumental music is critical to music information retrieval. To train a robust vocal detector, a large dataset marked with vocal or non-vocal label at frame-level is essential. However, frame-level labeling is time-consuming and labor expensive, resulting there is little well-labeled dataset available for singing-voice detection (S-VD). Hence, we propose a d… ▽ More

    Submitted 11 August, 2020; originally announced August 2020.

    Comments: Accepted by INTERSPEECH 2020

  49. arXiv:2006.01385  [pdf

    eess.IV cs.CV

    Adaptive convolutional neural networks for k-space data interpolation in fast magnetic resonance imaging

    Authors: Tianming Du, Honggang Zhang, Yuemeng Li, Hee Kwon Song, Yong Fan

    Abstract: Deep learning in k-space has demonstrated great potential for image reconstruction from undersampled k-space data in fast magnetic resonance imaging (MRI). However, existing deep learning-based image reconstruction methods typically apply weight-sharing convolutional neural networks (CNNs) to k-space data without taking into consideration the k-space data's spatial frequency properties, leading to… ▽ More

    Submitted 9 June, 2020; v1 submitted 2 June, 2020; originally announced June 2020.

  50. arXiv:2005.10438  [pdf, other

    cs.SD eess.AS

    Conversational End-to-End TTS for Voice Agent

    Authors: Haohan Guo, Shaofei Zhang, Frank K. Soong, Lei He, Lei Xie

    Abstract: End-to-end neural TTS has achieved superior performance on reading style speech synthesis. However, it's still a challenge to build a high-quality conversational TTS due to the limitations of the corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework. We firstly construct a spontaneous conversational speech c… ▽ More

    Submitted 16 November, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

    Comments: Accepted by SLT 2021; 7 pages