Skip to main content

Showing 1–6 of 6 results for author: Beliaev, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2110.03584  [pdf, other

    eess.AS

    Mixer-TTS: non-autoregressive, fast and compact text-to-speech model conditioned on language model embeddings

    Authors: Oktai Tatanov, Stanislav Beliaev, Boris Ginsburg

    Abstract: This paper describes Mixer-TTS, a non-autoregressive model for mel-spectrogram generation. The model is based on the MLP-Mixer architecture adapted for speech synthesis. The basic Mixer-TTS contains pitch and duration predictors, with the latter being trained with an unsupervised TTS alignment framework. Alongside the basic model, we propose the extended version which additionally uses token embed… ▽ More

    Submitted 22 October, 2021; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: Preprint. Submitted to ICASSP-22

  2. arXiv:2104.08189  [pdf, other

    eess.AS cs.AI

    TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction

    Authors: Stanislav Beliaev, Boris Ginsburg

    Abstract: We propose TalkNet, a non-autoregressive convolutional neural model for speech synthesis with explicit pitch and duration prediction. The model consists of three feed-forward convolutional networks. The first network predicts grapheme durations. An input text is expanded by repeating each symbol according to the predicted duration. The second network predicts pitch value for every mel frame. The t… ▽ More

    Submitted 17 June, 2021; v1 submitted 16 April, 2021; originally announced April 2021.

    Comments: arXiv admin note: substantial text overlap with arXiv:2005.05514

  3. arXiv:2005.07815  [pdf, other

    eess.AS cs.SD

    ConVoice: Real-Time Zero-Shot Voice Style Transfer with Convolutional Network

    Authors: Yurii Rebryk, Stanislav Beliaev

    Abstract: We propose a neural network for zero-shot voice conversion (VC) without any parallel or transcribed data. Our approach uses pre-trained models for automatic speech recognition (ASR) and speaker embedding, obtained from a speaker verification task. Our model is fully convolutional and non-autoregressive except for a small pre-trained recurrent neural network for speaker encoding. ConVoice can conve… ▽ More

    Submitted 15 May, 2020; originally announced May 2020.

  4. arXiv:2005.05514  [pdf, other

    eess.AS

    TalkNet: Fully-Convolutional Non-Autoregressive Speech Synthesis Model

    Authors: Stanislav Beliaev, Yurii Rebryk, Boris Ginsburg

    Abstract: We propose TalkNet, a convolutional non-autoregressive neural model for speech synthesis. The model consists of two feed-forward convolutional networks. The first network predicts grapheme durations. An input text is expanded by repeating each symbol according to the predicted duration. The second network generates a mel-spectrogram from the expanded text. To train a grapheme duration predictor, w… ▽ More

    Submitted 11 May, 2020; originally announced May 2020.

  5. arXiv:1910.10261  [pdf, other

    eess.AS

    QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions

    Authors: Samuel Kriman, Stanislav Beliaev, Boris Ginsburg, Jocelyn Huang, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Yang Zhang

    Abstract: We propose a new end-to-end neural acoustic model for automatic speech recognition. The model is composed of multiple blocks with residual connections between them. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It is trained with CTC loss. The proposed network achieves near state-of-the-art accuracy on LibriSpe… ▽ More

    Submitted 22 October, 2019; originally announced October 2019.

    Comments: Submitted to ICASSP 2020

  6. arXiv:1909.09577  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    NeMo: a toolkit for building AI applications using Neural Modules

    Authors: Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kriman, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, Patrice Castonguay, Mariya Popova, Jocelyn Huang, Jonathan M. Cohen

    Abstract: NeMo (Neural Modules) is a Python framework-agnostic toolkit for creating AI applications through re-usability, abstraction, and composition. NeMo is built around neural modules, conceptual blocks of neural networks that take typed inputs and produce typed outputs. Such modules typically represent data layers, encoders, decoders, language models, loss functions, or methods of combining activations… ▽ More

    Submitted 13 September, 2019; originally announced September 2019.

    Comments: 6 pages plus references