-
Implementation of Firm-Dispatchable Generation in South Africa
Authors:
Stephen R. Clark,
Craig McGregor
Abstract:
South Africa is currently facing a critical situation in its power generation landscape, which is plagued by frequent power outages and the need to move from fossil fuels to renewable energy sources. This period emphasizes the importance of having firm-dispatchable power to balance out the intermittent nature of wind and solar energy sources. The paper proposes to repurpose old coal-fired power pl…
▽ More
South Africa is currently facing a critical situation in its power generation landscape, which is plagued by frequent power outages and the need to move from fossil fuels to renewable energy sources. This period emphasizes the importance of having firm-dispatchable power to balance out the intermittent nature of wind and solar energy sources. The paper proposes to repurpose old coal-fired power plants to generate firm-dispatchable energy in line with the principles of a Just Transition. Eskom's coal plants are approaching the end of their economic life, and their declining energy availability factor is becoming a challenge in meeting the country's energy needs. The study suggests that a comprehensive strategy that integrates wind, solar, and firm-dispatchable power can be cost-effective and reliable compared to the traditional coal-based approach or the nuclear alternative. The study emphasizes the necessity of a 25-year plan that would invest in flexible and modular dispatchable generation. It also highlights the strategic location of this generating capacity, including repurposing decommissioned coal plant sites. The proposed model integrates private investment, adheres to established best practices, and emphasizes adaptability to changing demand dynamics. The study provides a roadmap for enabling firm-dispatchable capacity for South Africa's energy transition, emphasizing economic prudence, environmental sustainability, and alignment with the principles of the Just Transition program.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
Firm-Dispatchable Power and its Requirement in a Power System based on Variable Generation
Authors:
Stephen R. Clark,
Craig McGregor
Abstract:
Many countries have commenced a transition from fossil fuel-based electricity generation systems to sustainable systems based on wind and solar generation. It is often noted that the least cost approach would involve a massive scale-up in the building of variable renewables, supported by battery storage and gas peaking plants. The required backup should be firm-dispatchable generation rather than…
▽ More
Many countries have commenced a transition from fossil fuel-based electricity generation systems to sustainable systems based on wind and solar generation. It is often noted that the least cost approach would involve a massive scale-up in the building of variable renewables, supported by battery storage and gas peaking plants. The required backup should be firm-dispatchable generation rather than peaking power. The wind and solar generation aspects for this system are clearly defined and understood, however, the term firm-dispatchable power is not defined and the specific requirements are poorly understood. This study seeks to define firm-dispatchable power in this context and its requirement in the sustainable generation system. The study compares 100% renewable generation scenarios from South Africa, Texas, and the UK to demonstrate the requirement for this firm-dispatchable generation. The results indicate that firm-dispatchable generation must be available to replace the renewable generation completely. The required installed capacity for this firm-dispatchable generation does not vary with the distinct demand profiles of the different locations or their comparative renewable generation profiles. It also does not change significantly with the use of energy storage. The usage for this firm-dispatchable generation will vary due to the comparative economics of its use, but the requirement for its installation does not change.
△ Less
Submitted 16 March, 2024;
originally announced March 2024.
-
Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks
Authors:
Lev Finkelstein,
Heiga Zen,
Norman Casagrande,
Chun-an Chan,
Ye Jia,
Tom Kenter,
Alexey Petelin,
Jonathan Shen,
Vincent Wan,
Yu Zhang,
Yonghui Wu,
Rob Clark
Abstract:
Transfer tasks in text-to-speech (TTS) synthesis - where one or more aspects of the speech of one set of speakers is transferred to another set of speakers that do not feature these aspects originally - remains a challenging task. One of the challenges is that models that have high-quality transfer capabilities can have issues in stability, making them impractical for user-facing critical tasks. T…
▽ More
Transfer tasks in text-to-speech (TTS) synthesis - where one or more aspects of the speech of one set of speakers is transferred to another set of speakers that do not feature these aspects originally - remains a challenging task. One of the challenges is that models that have high-quality transfer capabilities can have issues in stability, making them impractical for user-facing critical tasks. This paper demonstrates that transfer can be obtained by training a robust TTS system on data generated by a less robust TTS system designed for a high-quality transfer task; in particular, a CHiVE-BERT monolingual TTS system is trained on the output of a Tacotron model designed for accent transfer. While some quality loss is inevitable with this approach, experimental results show that the models trained on synthetic data this way can produce high quality audio displaying accent transfer, while preserving speaker characteristics such as speaking style.
△ Less
Submitted 28 August, 2022;
originally announced August 2022.
-
ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech
Authors:
Xin Wang,
Junichi Yamagishi,
Massimiliano Todisco,
Hector Delgado,
Andreas Nautsch,
Nicholas Evans,
Md Sahidullah,
Ville Vestman,
Tomi Kinnunen,
Kong Aik Lee,
Lauri Juvela,
Paavo Alku,
Yu-Huai Peng,
Hsin-Te Hwang,
Yu Tsao,
Hsin-Min Wang,
Sebastien Le Maguer,
Markus Becker,
Fergus Henderson,
Rob Clark,
Yu Zhang,
Quan Wang,
Ye Jia,
Kai Onuma,
Koji Mushika
, et al. (15 additional authors not shown)
Abstract:
Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as "presentation attacks." These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentation attack detection" systems. In addition to imperso…
▽ More
Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as "presentation attacks." These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentation attack detection" systems. In addition to impersonation, ASV systems are vulnerable to replay, speech synthesis, and voice conversion attacks. The ASVspoof 2019 edition is the first to consider all three spoofing attack types within a single challenge. While they originate from the same source database and same underlying protocol, they are explored in two specific use case scenarios. Spoofing attacks within a logical access (LA) scenario are generated with the latest speech synthesis and voice conversion technologies, including state-of-the-art neural acoustic and waveform model techniques. Replay spoofing attacks within a physical access (PA) scenario are generated through carefully controlled simulations that support much more revealing analysis than possible previously. Also new to the 2019 edition is the use of the tandem detection cost function metric, which reflects the impact of spoofing and countermeasures on the reliability of a fixed ASV system. This paper describes the database design, protocol, spoofing attack implementations, and baseline ASV and countermeasure results. It also describes a human assessment on spoofed data in logical access. It was demonstrated that the spoofing data in the ASVspoof 2019 database have varied degrees of perceived quality and similarity to the target speakers, including spoofed data that cannot be differentiated from bona-fide utterances even by human subjects.
△ Less
Submitted 14 July, 2020; v1 submitted 4 November, 2019;
originally announced November 2019.
-
Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs
Authors:
Rob Clark,
Hanna Silen,
Tom Kenter,
Ralph Leith
Abstract:
Text-to-speech systems are typically evaluated on single sentences. When long-form content, such as data consisting of full paragraphs or dialogues is considered, evaluating sentences in isolation is not always appropriate as the context in which the sentences are synthesized is missing. In this paper, we investigate three different ways of evaluating the naturalness of long-form text-to-speech sy…
▽ More
Text-to-speech systems are typically evaluated on single sentences. When long-form content, such as data consisting of full paragraphs or dialogues is considered, evaluating sentences in isolation is not always appropriate as the context in which the sentences are synthesized is missing. In this paper, we investigate three different ways of evaluating the naturalness of long-form text-to-speech synthesis. We compare the results obtained from evaluating sentences in isolation, evaluating whole paragraphs of speech, and presenting a selection of speech or text as context and evaluating the subsequent speech. We find that, even though these three evaluations are based upon the same material, the outcomes differ per setting, and moreover that these outcomes do not necessarily correlate with each other. We show that our findings are consistent between a single speaker setting of read paragraphs and a two-speaker dialogue scenario. We conclude that to evaluate the quality of long-form speech, the traditional way of evaluating sentences in isolation does not suffice, and that multiple evaluations are required.
△ Less
Submitted 9 September, 2019;
originally announced September 2019.
-
CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network
Authors:
Vincent Wan,
Chun-an Chan,
Tom Kenter,
Jakub Vit,
Rob Clark
Abstract:
The prosodic aspects of speech signals produced by current text-to-speech systems are typically averaged over training material, and as such lack the variety and liveliness found in natural speech. To avoid monotony and averaged prosody contours, it is desirable to have a way of modeling the variation in the prosodic aspects of speech, so audio signals can be synthesized in multiple ways for a giv…
▽ More
The prosodic aspects of speech signals produced by current text-to-speech systems are typically averaged over training material, and as such lack the variety and liveliness found in natural speech. To avoid monotony and averaged prosody contours, it is desirable to have a way of modeling the variation in the prosodic aspects of speech, so audio signals can be synthesized in multiple ways for a given text. We present a new, hierarchically structured conditional variational autoencoder to generate prosodic features (fundamental frequency, energy and duration) suitable for use with a vocoder or a generative model like WaveNet. At inference time, an embedding representing the prosody of a sentence may be sampled from the variational layer to allow for prosodic variation. To efficiently capture the hierarchical nature of the linguistic input (words, syllables and phones), both the encoder and decoder parts of the auto-encoder are hierarchical, in line with the linguistic structure, with layers being clocked dynamically at the respective rates. We show in our experiments that our dynamic hierarchical network outperforms a non-hierarchical state-of-the-art baseline, and, additionally, that prosody transfer across sentences is possible by employing the prosody embedding of one sentence to generate the speech signal of another.
△ Less
Submitted 4 June, 2019; v1 submitted 17 May, 2019;
originally announced May 2019.
-
LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech
Authors:
Heiga Zen,
Viet Dang,
Rob Clark,
Yu Zhang,
Ron J. Weiss,
Ye Jia,
Zhifeng Chen,
Yonghui Wu
Abstract:
This paper introduces a new speech corpus called "LibriTTS" designed for text-to-speech use. It is derived from the original audio and text materials of the LibriSpeech corpus, which has been used for training and evaluating automatic speech recognition systems. The new corpus inherits desired properties of the LibriSpeech corpus while addressing a number of issues which make LibriSpeech less than…
▽ More
This paper introduces a new speech corpus called "LibriTTS" designed for text-to-speech use. It is derived from the original audio and text materials of the LibriSpeech corpus, which has been used for training and evaluating automatic speech recognition systems. The new corpus inherits desired properties of the LibriSpeech corpus while addressing a number of issues which make LibriSpeech less than ideal for text-to-speech work. The released corpus consists of 585 hours of speech data at 24kHz sampling rate from 2,456 speakers and the corresponding texts. Experimental results show that neural end-to-end TTS models trained from the LibriTTS corpus achieved above 4.0 in mean opinion scores in naturalness in five out of six evaluation speakers. The corpus is freely available for download from http://www.openslr.org/60/.
△ Less
Submitted 5 April, 2019;
originally announced April 2019.
-
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
Authors:
RJ Skerry-Ryan,
Eric Battenberg,
Ying Xiao,
Yuxuan Wang,
Daisy Stanton,
Joel Shor,
Ron J. Weiss,
Rob Clark,
Rif A. Saurous
Abstract:
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synth…
▽ More
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.
△ Less
Submitted 23 March, 2018;
originally announced March 2018.