Skip to main content

Showing 1–20 of 20 results for author: Eskimez, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.18009  [pdf, other

    eess.AS cs.SD

    E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

    Authors: Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, Naoyuki Kanda

    Abstract: This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS framework, the text input is converted into a character sequence with filler tokens. The flow-matching-based mel spectrogram generator is then trained based on the… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  2. arXiv:2406.05699  [pdf, ps, other

    eess.AS cs.AI eess.SP

    An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS

    Authors: Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Hemin Yang, Zirun Zhu, Min Tang, Yufei Xia, **zhu Li, Sheng Zhao, **yu Li, Naoyuki Kanda

    Abstract: Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker's voice from a short audio prompt, have made rapid advancements. However, the quality of the generated speech significantly deteriorates when the audio prompt contains noise, and limited research has been conducted to address this issue. In this paper, we explored various strategies to enhance the quality of audi… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: Accepted to INTERSPEECH2024

  3. arXiv:2406.04281  [pdf, other

    eess.AS

    Total-Duration-Aware Duration Modeling for Text-to-Speech Systems

    Authors: Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Chung-Hsien Tsai, Canrun Li, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, **yu Li, Sheng Zhao, Naoyuki Kanda

    Abstract: Accurate control of the total duration of generated speech by adjusting the speech rate is crucial for various text-to-speech (TTS) applications. However, the impact of adjusting the speech rate on speech quality, such as intelligibility and speaker characteristics, has been underexplored. In this work, we propose a novel total-duration-aware (TDA) duration model for TTS, where phoneme durations a… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  4. arXiv:2402.07383  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like

    Authors: Naoyuki Kanda, Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Hemin Yang, Zirun Zhu, Min Tang, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Yufei Xia, **zhu Li, Yanqing Liu, Sheng Zhao, Michael Zeng

    Abstract: Laughter is one of the most expressive and natural aspects of human speech, conveying emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the ability to produce realistic and appropriate laughter sounds, limiting their applications and user experience. While there have been prior works to generate natural laughter, they fell short in terms of controlling the timing an… ▽ More

    Submitted 4 March, 2024; v1 submitted 11 February, 2024; originally announced February 2024.

    Comments: See https://aka.ms/elate/ for demo samples, v2: subjective evaluation has been added

  5. arXiv:2308.06873  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

    Authors: Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, **yu Li, Takuya Yoshioka

    Abstract: Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile… ▽ More

    Submitted 25 June, 2024; v1 submitted 13 August, 2023; originally announced August 2023.

    Comments: To appear in TASLP. See https://aka.ms/speechx for demo samples

  6. arXiv:2303.07005  [pdf, other

    eess.AS

    Real-Time Audio-Visual End-to-End Speech Enhancement

    Authors: Zirun Zhu, Hemin Yang, Min Tang, Ziyi Yang, Sefik Emre Eskimez, Huaming Wang

    Abstract: Audio-visual speech enhancement (AV-SE) methods utilize auxiliary visual cues to enhance speakers' voices. Therefore, technically they should be able to outperform the audio-only speech enhancement (SE) methods. However, there are few works in the literature on an AV-SE system that can work in real time on a CPU. In this paper, we propose a low-latency real-time audio-visual end-to-end enhancement… ▽ More

    Submitted 13 March, 2023; originally announced March 2023.

    Comments: Accepted by ICASSP 2023

  7. arXiv:2211.05172  [pdf, other

    eess.AS cs.CL cs.SD

    Speech separation with large-scale self-supervised learning

    Authors: Zhuo Chen, Naoyuki Kanda, Jian Wu, Yu Wu, Xiaofei Wang, Takuya Yoshioka, **yu Li, Sunit Sivasankaran, Sefik Emre Eskimez

    Abstract: Self-supervised learning (SSL) methods such as WavLM have shown promising speech separation (SS) results in small-scale simulation-based experiments. In this work, we extend the exploration of the SSL-based SS by massively scaling up both the pre-training data (more than 300K hours) and fine-tuning data (10K hours). We also investigate various techniques to efficiently integrate the pre-trained mo… ▽ More

    Submitted 25 November, 2022; v1 submitted 9 November, 2022; originally announced November 2022.

  8. arXiv:2211.02944  [pdf, other

    eess.AS cs.SD

    Breaking the trade-off in personalized speech enhancement with cross-task knowledge distillation

    Authors: Hassan Taherian, Sefik Emre Eskimez, Takuya Yoshioka

    Abstract: Personalized speech enhancement (PSE) models achieve promising results compared with unconditional speech enhancement models due to their ability to remove interfering speech in addition to background noise. Unlike unconditional speech enhancement, causal PSE models may occasionally remove the target speech by mistake. The PSE models also tend to leak interfering speech when the target speaker is… ▽ More

    Submitted 5 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  9. arXiv:2211.02773  [pdf, other

    eess.AS cs.SD

    Real-Time Joint Personalized Speech Enhancement and Acoustic Echo Cancellation

    Authors: Sefik Emre Eskimez, Takuya Yoshioka, Alex Ju, Min Tang, Tanel Parnamaa, Huaming Wang

    Abstract: Personalized speech enhancement (PSE) is a real-time SE approach utilizing a speaker embedding of a target person to remove background noise, reverberation, and interfering voices. To deploy a PSE model for full duplex communications, the model must be combined with acoustic echo cancellation (AEC), although such a combination has been less explored. This paper proposes a series of methods that ar… ▽ More

    Submitted 25 May, 2023; v1 submitted 4 November, 2022; originally announced November 2022.

    Comments: Accepted to Interspeech 2023

  10. arXiv:2204.03232  [pdf, other

    eess.AS cs.AI eess.SP

    Leveraging Real Conversational Data for Multi-Channel Continuous Speech Separation

    Authors: Xiaofei Wang, Dongmei Wang, Naoyuki Kanda, Sefik Emre Eskimez, Takuya Yoshioka

    Abstract: Existing multi-channel continuous speech separation (CSS) models are heavily dependent on supervised data - either simulated data which causes data mismatch between the training and real-data testing, or the real transcribed overlap** data, which is difficult to be acquired, hindering further improvements in the conversational/meeting transcription tasks. In this paper, we propose a three-stage… ▽ More

    Submitted 7 April, 2022; originally announced April 2022.

    Comments: Submitted to INTERSPEECH 2022

  11. arXiv:2204.00771  [pdf, other

    eess.AS cs.SD eess.SP

    Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation

    Authors: Manthan Thakker, Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang

    Abstract: This paper investigates how to improve the runtime speed of personalized speech enhancement (PSE) networks while maintaining the model quality. Our approach includes two aspects: architecture and knowledge distillation (KD). We propose an end-to-end enhancement (E3Net) model architecture, which is $3\times$ faster than a baseline STFT-based model. Besides, we use KD techniques to develop compresse… ▽ More

    Submitted 2 April, 2022; originally announced April 2022.

    Comments: Submitted to Interspeech conference 2022 https://interspeech2022.org/

  12. arXiv:2202.13288  [pdf, other

    eess.AS cs.SD

    ICASSP 2022 Deep Noise Suppression Challenge

    Authors: Harishchandra Dubey, Vishak Gopal, Ross Cutler, Ashkan Aazami, Sergiy Matusevych, Sebastian Braun, Sefik Emre Eskimez, Manthan Thakker, Takuya Yoshioka, Hannes Gamper, Robert Aichner

    Abstract: The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression to achieve superior perceptual speech quality. This is the 4th DNS challenge, with the previous editions held at INTERSPEECH 2020, ICASSP 2021, and INTERSPEECH 2021. We open-source datasets and test sets for researchers to train their deep noise suppression models, as well as a subjective e… ▽ More

    Submitted 26 February, 2022; originally announced February 2022.

  13. arXiv:2112.05826  [pdf, other

    cs.CL cs.AI cs.LG eess.AS

    Sequence-level self-learning with multiple hypotheses

    Authors: Kenichi Kumatani, Dimitrios Dimitriadis, Yashesh Gaur, Robert Gmyr, Sefik Emre Eskimez, **yu Li, Michael Zeng

    Abstract: In this work, we develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR). For untranscribed speech data, the hypothesis from an ASR system must be used as a label. However, the imperfect ASR result makes unsupervised learning difficult to consistently improve recognition performance especially in the case that multipl… ▽ More

    Submitted 10 December, 2021; originally announced December 2021.

    Comments: Published in Interspeech 2020: https://www.isca-speech.org/archive_v0/Interspeech_2020/pdfs/2020.pdf

    Report number: https://www.isca-speech.org/archive_v0/Interspeech_2020/pdfs/2020.pdf

    Journal ref: Proc. Interspeech 2020, page 3775-3779

  14. arXiv:2110.14142  [pdf, other

    eess.AS cs.SD

    Separating Long-Form Speech with Group-Wise Permutation Invariant Training

    Authors: Wangyou Zhang, Zhuo Chen, Naoyuki Kanda, Shujie Liu, **yu Li, Sefik Emre Eskimez, Takuya Yoshioka, Xiong Xiao, Zhong Meng, Yanmin Qian, Furu Wei

    Abstract: Multi-talker conversational speech processing has drawn many interests for various applications such as meeting transcription. Speech separation is often required to handle overlapped speech that is commonly observed in conversation. Although the original utterancelevel permutation invariant training-based continuous speech separation approach has proven to be effective in various conditions, it l… ▽ More

    Submitted 17 November, 2021; v1 submitted 26 October, 2021; originally announced October 2021.

    Comments: 5 pages, 3 figures, 3 tables, submitted to IEEE ICASSP 2022

  15. arXiv:2110.10330  [pdf, other

    eess.AS cs.SD

    One model to enhance them all: array geometry agnostic multi-channel personalized speech enhancement

    Authors: Hassan Taherian, Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang, Zhuo Chen, Xuedong Huang

    Abstract: With the recent surge of video conferencing tools usage, providing high-quality speech signals and accurate captions have become essential to conduct day-to-day business or connect with friends and families. Single-channel personalized speech enhancement (PSE) methods show promising results compared with the unconditional speech enhancement (SE) methods in these scenarios due to their ability to r… ▽ More

    Submitted 19 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022

  16. arXiv:2110.09625  [pdf, other

    eess.AS cs.LG cs.SD

    Personalized Speech Enhancement: New Models and Comprehensive Evaluation

    Authors: Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang, Xiaofei Wang, Zhuo Chen, Xuedong Huang

    Abstract: Personalized speech enhancement (PSE) models utilize additional cues, such as speaker embeddings like d-vectors, to remove background noise and interfering speech in real-time and thus improve the speech quality of online video conferencing systems for various acoustic scenarios. In this work, we propose two neural networks for PSE that achieve superior performance to the previously proposed Voice… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

  17. arXiv:2110.06428  [pdf, other

    eess.AS cs.SD

    All-neural beamformer for continuous speech separation

    Authors: Zhuohuang Zhang, Takuya Yoshioka, Naoyuki Kanda, Zhuo Chen, Xiaofei Wang, Dongmei Wang, Sefik Emre Eskimez

    Abstract: Continuous speech separation (CSS) aims to separate overlap** voices from a continuous influx of conversational audio containing an unknown number of utterances spoken by an unknown number of speakers. A common application scenario is transcribing a meeting conversation recorded by a microphone array. Prior studies explored various deep learning models for time-frequency mask estimation, followe… ▽ More

    Submitted 12 October, 2021; originally announced October 2021.

    Comments: 5 pages, 3 figures, 2 tables

  18. arXiv:2106.02896  [pdf, other

    eess.AS

    Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement

    Authors: Sefik Emre Eskimez, Xiaofei Wang, Min Tang, Hemin Yang, Zirun Zhu, Zhuo Chen, Huaming Wang, Takuya Yoshioka

    Abstract: With the surge of online meetings, it has become more critical than ever to provide high-quality speech audio and live captioning under various noise conditions. However, most monaural speech enhancement (SE) models introduce processing artifacts and thus degrade the performance of downstream tasks, including automatic speech recognition (ASR). This paper proposes a multi-task training framework t… ▽ More

    Submitted 5 June, 2021; originally announced June 2021.

    Comments: Accepted to INTERSPEECH2021

  19. arXiv:2102.11114  [pdf, other

    cs.CL cs.SD eess.AS

    Generating Human Readable Transcript for Automatic Speech Recognition with Pre-trained Language Model

    Authors: Junwei Liao, Yu Shi, Ming Gong, Linjun Shou, Sefik Eskimez, Liyang Lu, Hong Qu, Michael Zeng

    Abstract: Modern Automatic Speech Recognition (ASR) systems can achieve high performance in terms of recognition accuracy. However, a perfectly accurate transcript still can be challenging to read due to disfluency, filter words, and other errata common in spoken communication. Many downstream tasks and human readers rely on the output of the ASR system; therefore, errors introduced by the speaker and ASR s… ▽ More

    Submitted 22 February, 2021; originally announced February 2021.

    Comments: Accepted in 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2021)

  20. arXiv:2008.03592  [pdf, other

    eess.AS cs.CV cs.LG cs.MM

    Speech Driven Talking Face Generation from a Single Image and an Emotion Condition

    Authors: Sefik Emre Eskimez, You Zhang, Zhiyao Duan

    Abstract: Visual emotion expression plays an important role in audiovisual speech communication. In this work, we propose a novel approach to rendering visual emotion expression in speech-driven talking face generation. Specifically, we design an end-to-end talking face generation system that takes a speech utterance, a single face image, and a categorical emotion label as input to render a talking face vid… ▽ More

    Submitted 21 July, 2021; v1 submitted 8 August, 2020; originally announced August 2020.

    Comments: Accepted to IEEE Transactions on Multimedia