-
Generacion de voces artificiales infantiles en castellano con acento costarricense
Authors:
Ana Lilia Alvarez-Blanco,
Eugenia Cordoba-Warner,
Marvin Coto-Jimenez,
Vivian Fallas-Lopez,
Maribel Morales Rodriguez
Abstract:
This article evaluates a first experience of generating artificial children's voices with a Costa Rican accent, using the technique of statistical parametric speech synthesis based on Hidden Markov Models. The process of recording the voice samples used for learning the models, the fundamentals of the technique used and the subjective evaluation of the results through the perception of a group of…
▽ More
This article evaluates a first experience of generating artificial children's voices with a Costa Rican accent, using the technique of statistical parametric speech synthesis based on Hidden Markov Models. The process of recording the voice samples used for learning the models, the fundamentals of the technique used and the subjective evaluation of the results through the perception of a group of people is described. The results show that the intelligibility of the results, evaluated in isolated words, is lower than the voices recorded by the group of participating children. Similarly, the detection of the age and gender of the speaking person is significantly affected in artificial voices, relative to recordings of natural voices. These results show the need to obtain larger amounts of data, in addition to becoming a numerical reference for future developments resulting from new data or from processes to improve results in the same technique.
△ Less
Submitted 1 February, 2021;
originally announced February 2021.
-
Supervised Initialization of LSTM Networks for Fundamental Frequency Detection in Noisy Speech Signals
Authors:
Marvin Coto-Jimenez
Abstract:
Fundamental frequency is one of the most important parameters of human speech, of importance for the classification of accent, gender, speaking styles, speaker identification, age, among others. The proper detection of this parameter remains as an important challenge for severely degraded signals. In previous references for detecting fundamental frequency in noisy speech using deep learning, the n…
▽ More
Fundamental frequency is one of the most important parameters of human speech, of importance for the classification of accent, gender, speaking styles, speaker identification, age, among others. The proper detection of this parameter remains as an important challenge for severely degraded signals. In previous references for detecting fundamental frequency in noisy speech using deep learning, the networks, such as Long Short-term Memory (LSTM) has been initialized with random weights, and then trained following a back-propagation through time algorithm. In this work, a proposal for a more efficient initialization, based on a supervised training using an Auto-associative network, is presented. This initialization is a better starting point for the detection of fundamental frequency in noisy speech. The advantages of this initialization are noticeable using objective measures for the accuracy of the detection and for the training of the networks, under the presence of additive white noise at different signal-to-noise levels.
△ Less
Submitted 11 November, 2019;
originally announced November 2019.
-
LSTM Deep Neural Networks Postfiltering for Improving the Quality of Synthetic Voices
Authors:
Marvin Coto-Jiménez,
John Goddard-Close
Abstract:
Recent developments in speech synthesis have produced systems capable of outcome intelligible speech, but now researchers strive to create models that more accurately mimic human voices. One such development is the incorporation of multiple linguistic styles in various languages and accents.
HMM-based Speech Synthesis is of great interest to many researchers, due to its ability to produce sophis…
▽ More
Recent developments in speech synthesis have produced systems capable of outcome intelligible speech, but now researchers strive to create models that more accurately mimic human voices. One such development is the incorporation of multiple linguistic styles in various languages and accents.
HMM-based Speech Synthesis is of great interest to many researchers, due to its ability to produce sophisticated features with small footprint. Despite such progress, its quality has not yet reached the level of the predominant unit-selection approaches that choose and concatenate recordings of real speech. Recent efforts have been made in the direction of improving these systems.
In this paper we present the application of Long-Short Term Memory Deep Neural Networks as a Postfiltering step of HMM-based speech synthesis, in order to obtain closer spectral characteristics to those of natural speech. The results show how HMM-voices could be improved using this approach.
△ Less
Submitted 8 February, 2016;
originally announced February 2016.