Skip to main content

Showing 1–7 of 7 results for author: Yin, D

Searching in archive eess. Search in all archives.
.
  1. arXiv:2304.05922  [pdf, other

    eess.AS cs.SD

    Filler Word Detection with Hard Category Mining and Inter-Category Focal Loss

    Authors: Zhiyuan Zhao, Lijun Wu, Chuanxin Tang, Dacheng Yin, Yucheng Zhao, Chong Luo

    Abstract: Filler words like ``um" or ``uh" are common in spontaneous speech. It is desirable to automatically detect and remove them in recordings, as they affect the fluency, confidence, and professionalism of speech. Previous studies and our preliminary experiments reveal that the biggest challenge in filler word detection is that fillers can be easily confused with other hard categories like ``a" or ``I"… ▽ More

    Submitted 12 April, 2023; originally announced April 2023.

    Comments: accepted by ICASSP23

  2. arXiv:2210.12995  [pdf, other

    eess.AS cs.SD

    TridentSE: Guiding Speech Enhancement with 32 Global Tokens

    Authors: Dacheng Yin, Zhiyuan Zhao, Chuanxin Tang, Zhiwei Xiong, Chong Luo

    Abstract: In this paper, we present TridentSE, a novel architecture for speech enhancement, which is capable of efficiently capturing both global information and local details. TridentSE maintains T-F bin level representation to capture details, and uses a small number of global tokens to process the global information. Information is propagated between the local and the global representations through cross… ▽ More

    Submitted 24 October, 2022; originally announced October 2022.

    Comments: 5 pages, 2 figures, 3 tables

  3. arXiv:2206.13865  [pdf, other

    eess.AS cs.SD

    RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion

    Authors: Dacheng Yin, Chuanxin Tang, Yanqing Liu, Xiaoqiang Wang, Zhiyuan Zhao, Yucheng Zhao, Zhiwei Xiong, Sheng Zhao, Chong Luo

    Abstract: This paper proposes a new "decompose-and-edit" paradigm for the text-based speech insertion task that facilitates arbitrary-length speech insertion and even full sentence generation. In the proposed paradigm, global and local factors in speech are explicitly decomposed and separately manipulated to achieve high speaker similarity and continuous prosody. Specifically, we proposed to represent the g… ▽ More

    Submitted 28 June, 2022; originally announced June 2022.

    Comments: 5 pages, 1 figure, 3 tables. Accepted by Interspeech 2022

  4. arXiv:2202.12307  [pdf, other

    cs.LG cs.AI cs.CV cs.SD eess.AS

    Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph

    Authors: Dacheng Yin, Xuanchi Ren, Chong Luo, Yuwang Wang, Zhiwei Xiong, Wenjun Zeng

    Abstract: This paper addresses the unsupervised learning of content-style decomposed representation. We first give a definition of style and then model the content-style representation as a token-level bipartite graph. An unsupervised framework, named Retriever, is proposed to learn such representations. First, a cross-attention module is employed to retrieve permutation invariant (P.I.) information, define… ▽ More

    Submitted 24 February, 2022; originally announced February 2022.

    Comments: Accepted to ICLR 2022. Project page at https://ydcustc.github.io/retriever-demo/

  5. arXiv:2109.05426  [pdf, other

    cs.SD cs.AI eess.AS

    Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

    Authors: Chuanxin Tang, Chong Luo, Zhiyuan Zhao, Dacheng Yin, Yucheng Zhao, Wenjun Zeng

    Abstract: Given a piece of speech and its transcript text, text-based speech editing aims to generate speech that can be seamlessly inserted into the given speech by editing the transcript. Existing methods adopt a two-stage approach: synthesize the input text using a generic text-to-speech (TTS) engine and then transform the voice to the desired voice using voice conversion (VC). A major problem of this fr… ▽ More

    Submitted 12 September, 2021; originally announced September 2021.

    Comments: Published in Interspeech'21

  6. arXiv:2102.01930  [pdf, other

    cs.SD cs.LG eess.AS

    General-Purpose Speech Representation Learning through a Self-Supervised Multi-Granularity Framework

    Authors: Yucheng Zhao, Dacheng Yin, Chong Luo, Zhiyuan Zhao, Chuanxin Tang, Wenjun Zeng, Zheng-Jun Zha

    Abstract: This paper presents a self-supervised learning framework, named MGF, for general-purpose speech representation learning. In the design of MGF, speech hierarchy is taken into consideration. Specifically, we propose to use generative learning approaches to capture fine-grained information at small time scales and use discriminative learning approaches to distill coarse-grained or semantic informatio… ▽ More

    Submitted 3 February, 2021; originally announced February 2021.

  7. arXiv:1911.04697  [pdf, other

    cs.SD eess.AS

    PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network

    Authors: Dacheng Yin, Chong Luo, Zhiwei Xiong, Wenjun Zeng

    Abstract: Time-frequency (T-F) domain masking is a mainstream approach for single-channel speech enhancement. Recently, focuses have been put to phase prediction in addition to amplitude prediction. In this paper, we propose a phase-and-harmonics-aware deep neural network (DNN), named PHASEN, for this task. Unlike previous methods that directly use a complex ideal ratio mask to supervise the DNN learning, w… ▽ More

    Submitted 12 November, 2019; originally announced November 2019.

    Comments: Accepted by AAAI'20