Search | arXiv e-print repository

Recovering implicit pitch contours from formants in whispered speech

Authors: Pablo Pérez Zarazaga, Zofia Malisz

Abstract: Whispered speech is characterised by a noise-like excitation that results in the lack of fundamental frequency. Considering that prosodic phenomena such as intonation are perceived through f0 variation, the perception of whispered prosody is relatively difficult. At the same time, studies have shown that speakers do attempt to produce intonation when whispering and that prosodic variability is bei… ▽ More Whispered speech is characterised by a noise-like excitation that results in the lack of fundamental frequency. Considering that prosodic phenomena such as intonation are perceived through f0 variation, the perception of whispered prosody is relatively difficult. At the same time, studies have shown that speakers do attempt to produce intonation when whispering and that prosodic variability is being transmitted, suggesting that intonation "survives" in whispered formant structure. In this paper, we aim to estimate the way in which formant contours correlate with an "implicit" pitch contour in whisper, using a machine learning model. We propose a two-step method: using a parallel corpus, we first transform the whispered formants into their phonated equivalents using a denoising autoencoder. We then analyse the formant contours to predict phonated pitch contour variation. We observe that our method is effective in establishing a relationship between whispered and phonated formants and in uncovering implicit pitch contours in whisper. △ Less

Submitted 6 July, 2023; originally announced July 2023.

Comments: 5 pages, 3 figures, 2 tables, Accepted at ICPhS 2023

arXiv:2306.01957 [pdf, other]

Speaker-independent neural formant synthesis

Authors: Pablo Pérez Zarazaga, Zofia Malisz, Gustav Eje Henter, Lauri Juvela

Abstract: We describe speaker-independent speech synthesis driven by a small set of phonetically meaningful speech parameters such as formant frequencies. The intention is to leverage deep-learning advances to provide a highly realistic signal generator that includes control affordances required for stimulus creation in the speech sciences. Our approach turns input speech parameters into predicted mel-spect… ▽ More We describe speaker-independent speech synthesis driven by a small set of phonetically meaningful speech parameters such as formant frequencies. The intention is to leverage deep-learning advances to provide a highly realistic signal generator that includes control affordances required for stimulus creation in the speech sciences. Our approach turns input speech parameters into predicted mel-spectrograms, which are rendered into waveforms by a pre-trained neural vocoder. Experiments with WaveNet and HiFi-GAN confirm that the method achieves our goals of accurate control over speech parameters combined with high perceptual audio quality. We also find that the small set of phonetically relevant speech parameters we use is sufficient to allow for speaker-independent synthesis (a.k.a. universal vocoding). △ Less

Submitted 2 June, 2023; originally announced June 2023.

Comments: 5 pages, 4 figures. Article accepted at INTERSPEECH 2023

arXiv:2303.07442 [pdf, other]

A processing framework to access large quantities of whispered speech found in ASMR

Authors: Pablo Perez Zarazaga, Gustav Eje Henter, Zofia Malisz

Abstract: Whispering is a ubiquitous mode of communication that humans use daily. Despite this, whispered speech has been poorly served by existing speech technology due to a shortage of resources and processing methodology. To remedy this, this paper provides a processing framework that enables access to large and unique data of high-quality whispered speech. We obtain the data from recordings submitted to… ▽ More Whispering is a ubiquitous mode of communication that humans use daily. Despite this, whispered speech has been poorly served by existing speech technology due to a shortage of resources and processing methodology. To remedy this, this paper provides a processing framework that enables access to large and unique data of high-quality whispered speech. We obtain the data from recordings submitted to online platforms as part of the ASMR media-cultural phenomenon. We describe our processing pipeline and a method for improved whispered activity detection (WAD) in the ASMR data. To efficiently obtain labelled, clean whispered speech, we complement the automatic WAD by using Edyson, a bulk audio-annotation tool with human-in-the-loop. We also tackle a problem particular to ASMR: separation of whisper from other acoustic triggers present in the genre. We show that the proposed WAD and the efficient labelling allows to build extensively augmented data and train a classifier that extracts clean whisper segments from ASMR audio. Our large and growing dataset enables whisper-capable, data-driven speech technology and linguistic analysis. It also opens opportunities in e.g. HCI as a resource that may elicit emotional, psychological and neuro-physiological responses in the listener. △ Less

Submitted 13 March, 2023; originally announced March 2023.

Comments: Accepted at ICASSP 2023, 5 pages, 2 figures, 2 tables

arXiv:2010.09488 [pdf, other]

Users Perceptions about Teleconferencing Applications Collected through Twitter

Authors: Abraham Woubie, Pablo Pérez Zarazaga, Tom Bäckström

Abstract: The COVID-19 outbreak disrupted different organizations, employees and students, who turned to teleconference applications to collaborate and socialize even during the quarantine. Thus, the demand of teleconferencing applications surged with mobile application downloads reaching the highest number ever seen. However, some of the teleconference applications recently suffered from several issues suc… ▽ More The COVID-19 outbreak disrupted different organizations, employees and students, who turned to teleconference applications to collaborate and socialize even during the quarantine. Thus, the demand of teleconferencing applications surged with mobile application downloads reaching the highest number ever seen. However, some of the teleconference applications recently suffered from several issues such as security, privacy, media quality, reliability, capacity and technical difficulties. Thus, in this work, we explore the opinions of different users towards different teleconference applications. Firstly, posts on Twitter, known as tweets, about remote working and different teleconference applications are extracted using different keywords. Then, the extracted tweets are passed to sentiment classifier to classify the tweets into positive and negative. Afterwards, the most important features of different teleconference applications are extracted and analyzed. Finally, we highlight the main strengths, drawbacks and challenges of different teleconference applications. △ Less

Submitted 16 October, 2020; originally announced October 2020.

Showing 1–4 of 4 results for author: Zarazaga, P P