Search | arXiv e-print repository

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Authors: Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszyńska, Haohan Guo, Bartosz Putrycz, Soledad López Gambino, Kayeon Yoo, Elena Sokolova, Thomas Drugman

Abstract: We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts ra… ▽ More We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/. △ Less

Submitted 15 February, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

Comments: v1.1 (fixed typos)

arXiv:2307.07062 [pdf, other]

Controllable Emphasis with zero data for text-to-speech

Authors: Arnaud Joly, Marco Nicolis, Ekaterina Peterova, Alessandro Lombardi, Ammar Abbas, Arent van Korlaar, Aman Hussain, Parul Sharma, Alexis Moinet, Mateusz Lajszczak, Penny Karanasou, Antonio Bonafonte, Thomas Drugman, Elena Sokolova

Abstract: We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques im… ▽ More We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques improving naturalness by $7.3\%$ and correct testers' identification of the emphasized word in a sentence by $40\%$ on a reference female en-US voice. We show that this technique significantly closes the gap to methods that require explicit recordings. The method proved to be scalable and preferred in all four languages tested (English, Spanish, Italian, German), for different voices and multiple speaking styles. △ Less

Submitted 13 July, 2023; originally announced July 2023.

Comments: In proceeding of 12th Speech Synthesis Workshop (SSW) 2023

arXiv:2303.16085 [pdf, other]

Whole-body PET image denoising for reduced acquisition time

Authors: Ivan Kruzhilov, Stepan Kudin, Luka Vetoshkin, Elena Sokolova, Vladimir Kokh

Abstract: This paper evaluates the performance of supervised and unsupervised deep learning models for denoising positron emission tomography (PET) images in the presence of reduced acquisition times. Our experiments consider 212 studies (56908 images), and evaluate the models using 2D (RMSE, SSIM) and 3D (SUVpeak and SUVmax error for the regions of interest) metrics. It was shown that, in contrast to previ… ▽ More This paper evaluates the performance of supervised and unsupervised deep learning models for denoising positron emission tomography (PET) images in the presence of reduced acquisition times. Our experiments consider 212 studies (56908 images), and evaluate the models using 2D (RMSE, SSIM) and 3D (SUVpeak and SUVmax error for the regions of interest) metrics. It was shown that, in contrast to previous studies, supervised models (ResNet, Unet, SwinIR) outperform unsupervised models (pix2pix GAN and CycleGAN with ResNet backbone and various auxiliary losses) in the reconstruction of 2D PET images. Moreover, a hybrid approach of supervised CycleGAN shows the best results in SUVmax estimation for denoised images, and the SUVmax estimation error for denoised images is comparable with the PET reproducibility error. △ Less

Submitted 28 March, 2023; originally announced March 2023.

arXiv:2202.06409 [pdf, other]

Distribution augmentation for low-resource expressive text-to-speech

Authors: Mateusz Lajszczak, Animesh Prasad, Arent van Korlaar, Bajibabu Bollepalli, Antonio Bonafonte, Arnaud Joly, Marco Nicolis, Alexis Moinet, Thomas Drugman, Trevor Wood, Elena Sokolova

Abstract: This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data. Our goal is to increase diversity of text conditionings available during training. This helps to reduce overfitting, especially in low-resource settings. Our method relies on substituting text and audio fragments in a w… ▽ More This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data. Our goal is to increase diversity of text conditionings available during training. This helps to reduce overfitting, especially in low-resource settings. Our method relies on substituting text and audio fragments in a way that preserves syntactical correctness. We take additional measures to ensure that synthesized speech does not contain artifacts caused by combining inconsistent audio samples. The perceptual evaluations show that our method improves speech quality over a number of datasets, speakers, and TTS architectures. We also demonstrate that it greatly improves robustness of attention-based TTS models. △ Less

Submitted 19 February, 2022; v1 submitted 13 February, 2022; originally announced February 2022.

Comments: ICASSP 2022: camera-ready

arXiv:2105.11863 [pdf, other]

CoRSAI: A System for Robust Interpretation of CT Scans of COVID-19 Patients Using Deep Learning

Authors: Manvel Avetisian, Ilya Burenko, Konstantin Egorov, Vladimir Kokh, Aleksandr Nesterov, Aleksandr Nikolaev, Alexander Ponomarchuk, Elena Sokolova, Alex Tuzhilin, Dmitry Umerenkov

Abstract: Analysis of chest CT scans can be used in detecting parts of lungs that are affected by infectious diseases such as COVID-19.Determining the volume of lungs affected by lesions is essential for formulating treatment recommendations and prioritizingpatients by severity of the disease. In this paper we adopted an approach based on using an ensemble of deep convolutionalneural networks for segmentati… ▽ More Analysis of chest CT scans can be used in detecting parts of lungs that are affected by infectious diseases such as COVID-19.Determining the volume of lungs affected by lesions is essential for formulating treatment recommendations and prioritizingpatients by severity of the disease. In this paper we adopted an approach based on using an ensemble of deep convolutionalneural networks for segmentation of slices of lung CT scans. Using our models we are able to segment the lesions, evaluatepatients dynamics, estimate relative volume of lungs affected by lesions and evaluate the lung damage stage. Our modelswere trained on data from different medical centers. We compared predictions of our models with those of six experiencedradiologists and our segmentation model outperformed most of them. On the task of classification of disease severity, ourmodel outperformed all the radiologists. △ Less

Submitted 25 May, 2021; originally announced May 2021.

arXiv:2011.09303 [pdf, other]

Noise-Resilient Automatic Interpretation of Holter ECG Recordings

Authors: Konstantin Egorov, Elena Sokolova, Manvel Avetisian, Alexander Tuzhilin

Abstract: Holter monitoring, a long-term ECG recording (24-hours and more), contains a large amount of valuable diagnostic information about the patient. Its interpretation becomes a difficult and time-consuming task for the doctor who analyzes them because every heartbeat needs to be classified, thus requiring highly accurate methods for automatic interpretation. In this paper, we present a three-stage pro… ▽ More Holter monitoring, a long-term ECG recording (24-hours and more), contains a large amount of valuable diagnostic information about the patient. Its interpretation becomes a difficult and time-consuming task for the doctor who analyzes them because every heartbeat needs to be classified, thus requiring highly accurate methods for automatic interpretation. In this paper, we present a three-stage process for analysing Holter recordings with robustness to noisy signal. First stage is a segmentation neural network (NN) with encoderdecoder architecture which detects positions of heartbeats. Second stage is a classification NN which will classify heartbeats as wide or narrow. Third stage in gradient boosting decision trees (GBDT) on top of NN features that incorporates patient-wise features and further increases performance of our approach. As a part of this work we acquired 5095 Holter recordings of patients annotated by an experienced cardiologist. A committee of three cardiologists served as a ground truth annotators for the 291 examples in the test set. We show that the proposed method outperforms the selected baselines, including two commercial-grade software packages and some methods previously published in the literature. △ Less

Submitted 17 November, 2020; originally announced November 2020.

Comments: Accepted for publication on BIOSIGNALS 2021

arXiv:2006.15956 [pdf, other]

doi 10.1051/0004-6361/202038822

Search for glitches of gamma-ray pulsars with deep learning

Authors: E. V. Sokolova, A. G. Panin

Abstract: The pulsar glitches are generally assumed to be an apparent manifestation of the superfluid interior of the neutron stars. Most of them were discovered and extensively studied by continuous monitoring in the radio wavelengths. The Fermi-LAT space telescope has made a revolution uncovering a large population of gamma-ray pulsars. In this paper we suggest to employ these observations for the searche… ▽ More The pulsar glitches are generally assumed to be an apparent manifestation of the superfluid interior of the neutron stars. Most of them were discovered and extensively studied by continuous monitoring in the radio wavelengths. The Fermi-LAT space telescope has made a revolution uncovering a large population of gamma-ray pulsars. In this paper we suggest to employ these observations for the searches of new glitches. We develop the method capable of detecting step-like frequency change associated with glitches in a sparse gamma-ray data. It is based on the calculations of the weighted H-test statistics and glitch identification by a convolutional neural network. The method demonstrates high accuracy on the Monte Carlo set and will be applied for searches of the pulsar glitches in the real gamma-ray data in the future works. △ Less

Submitted 26 May, 2021; v1 submitted 29 June, 2020; originally announced June 2020.

Comments: 5 pages, 5 figures

Report number: INR-TH-2020-020

Journal ref: A&A 660, A43 (2022)

arXiv:1601.00330 [pdf, other]

doi 10.3847/1538-4357/833/2/271

Search for differences between radio-loud and radio-quiet gamma-ray pulsar populations with Fermi-LAT data

Authors: E. V. Sokolova, G. I. Rubtsov

Abstract: Observations by Fermi LAT enabled us to explore the population of non-recycled gamma-ray pulsars with the set of 89 objects. It was recently noted that there are apparent differences in properties of radio-quiet and radio-loud subsets. In particular, average observed radio-loud pulsar is younger than radio-quiet one and is located at smaller galactic latitude. Even so, the analysis based on the fu… ▽ More Observations by Fermi LAT enabled us to explore the population of non-recycled gamma-ray pulsars with the set of 89 objects. It was recently noted that there are apparent differences in properties of radio-quiet and radio-loud subsets. In particular, average observed radio-loud pulsar is younger than radio-quiet one and is located at smaller galactic latitude. Even so, the analysis based on the full list of pulsars may suffer from selection effects. Namely, most of radio-loud pulsars are first discovered in the radio-band, while radio-quiet ones are found using the gamma-ray data. In this work we perform a blind search for gamma-ray pulsars using the Fermi LAT data alone using all point sources from 3FGL catalog as the candidates. Unlike preceding blind search, the present catalog is constructed with novel semi-coherent method and covers the full range of characteristic ages down to 1 kyr. The search resulted in the catalog of 40 non-recycled pulsars, 26 of which are radio-quiet. There are no statistically significant differences in age and galactic latitude distributions for the radio-loud and radio-quiet pulsars, while the rotation period distributions are marginally different with $2.4σ$ pre-trial statistical significance. The fraction of radio-quiet pulsars is estimated as $ε_{RQ}=63\pm 8\%$. The results are in agreement with the predictions of the outer magnitosphere models, while the Polar cap models are disfavored. △ Less

Submitted 14 November, 2016; v1 submitted 3 January, 2016; originally announced January 2016.

Comments: 6 pages, 7 figures

Report number: INR-TH/2016-001

Journal ref: Astrophys.J. 833 (2016) no.2, 271

arXiv:1406.0608 [pdf, other]

doi 10.7868/S0370274X14230015

Blind search for radio-quiet and radio-loud gamma-ray pulsars with Fermi-LAT data

Authors: G. I. Rubtsov, E. V. Sokolova

Abstract: The Fermi Large Area Telescope (LAT) has observed more than a hundred of gamma-ray pulsars, about one third of which are radio-quiet, i.e. not detected at radio frequencies. The most of radio-loud pulsars are detected by Fermi LAT by using the radio timing models, while the radio-quiet ones are discovered in a blind search. The difference in the techniques introduces an observational selection bia… ▽ More The Fermi Large Area Telescope (LAT) has observed more than a hundred of gamma-ray pulsars, about one third of which are radio-quiet, i.e. not detected at radio frequencies. The most of radio-loud pulsars are detected by Fermi LAT by using the radio timing models, while the radio-quiet ones are discovered in a blind search. The difference in the techniques introduces an observational selection bias and, consequently, the direct comparison of populations is complicated. In order to produce an unbiased sample, we perform a blind search of gamma-ray pulsations using Fermi-LAT data alone. No radio data or observations at optical or X-ray frequencies are involved in the search process. We produce a gamma-ray selected catalog of 25 non-recycled gamma-ray pulsars found in a blind search, including 16 radio-quiet and 9 radio-loud pulsars. This results in the direct measurement of the fraction of radio-quiet pulsars $\varepsilon_{RQ} = 64\pm 10\%$, which is in agreement with the existing estimates from the population modeling in the outer magnetosphere model. The Polar cap models are disfavored due to a lower expected fraction and the prediction of age dependence. The age, gamma-ray energy flux, spin-down luminosity and sky location distributions of the radio-loud and radio-quiet pulsars from the catalog do not demonstrate any statistically significant difference. The results indicate that the radio-quiet and radio-loud pulsars belong to one and the same population. The catalog shows no evidence for the radio beam evolution. △ Less

Submitted 30 October, 2014; v1 submitted 3 June, 2014; originally announced June 2014.

Comments: 5 pages, 3 figures; accepted for publication in JETP Letters

Report number: INR-TH/2014-013

Journal ref: Pis'ma v ZhETF 100:787-792, 2014

arXiv:0901.0191 [pdf, ps, other]

Field theoretical representation of classical statistical mechanics. I. Wave-vector space

Authors: A. Yu. Zakharov, E. V. Sokolova

Abstract: Thermodynamic equivalence between classical many-body system and some auxiliary nonlinear auxiliary field is proved. Connection between Hamiltonians of the many-body system and the auxiliary field is derived. Thermodynamic equivalence between classical many-body system and some auxiliary nonlinear auxiliary field is proved. Connection between Hamiltonians of the many-body system and the auxiliary field is derived. △ Less

Submitted 1 January, 2009; originally announced January 2009.

Comments: 7 pages. PACS 05.20.-y, 05.70.Ce, 03.50.Kk

Showing 1–10 of 10 results for author: Sokolova, E