-
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
Authors:
Jaeyeon Kim,
Jaeyoon Jung,
**joo Lee,
Sang Hoon Woo
Abstract:
We propose EnCLAP, a novel framework for automated audio captioning. EnCLAP employs two acoustic representation models, EnCodec and CLAP, along with a pretrained language model, BART. We also introduce a new training objective called masked codec modeling that improves acoustic awareness of the pretrained language model. Experimental results on AudioCaps and Clotho demonstrate that our model surpa…
▽ More
We propose EnCLAP, a novel framework for automated audio captioning. EnCLAP employs two acoustic representation models, EnCodec and CLAP, along with a pretrained language model, BART. We also introduce a new training objective called masked codec modeling that improves acoustic awareness of the pretrained language model. Experimental results on AudioCaps and Clotho demonstrate that our model surpasses the performance of baseline models. Source code will be available at https://github.com/jaeyeonkim99/EnCLAP . An online demo is available at https://huggingface.co/spaces/enclap-team/enclap .
△ Less
Submitted 31 January, 2024;
originally announced January 2024.
-
SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech
Authors:
Hyunjae Cho,
Wonbin Jung,
Junhyeok Lee,
Sang Hoon Woo
Abstract:
In this paper, we present SANE-TTS, a stable and natural end-to-end multilingual TTS model. By the difficulty of obtaining multilingual corpus for given speaker, training multilingual TTS model with monolingual corpora is unavoidable. We introduce speaker regularization loss that improves speech naturalness during cross-lingual synthesis as well as domain adversarial training, which is applied in…
▽ More
In this paper, we present SANE-TTS, a stable and natural end-to-end multilingual TTS model. By the difficulty of obtaining multilingual corpus for given speaker, training multilingual TTS model with monolingual corpora is unavoidable. We introduce speaker regularization loss that improves speech naturalness during cross-lingual synthesis as well as domain adversarial training, which is applied in other multilingual TTS models. Furthermore, by adding speaker regularization loss, replacing speaker embedding with zero vector in duration predictor stabilizes cross-lingual inference. With this replacement, our model generates speeches with moderate rhythm regardless of source speaker in cross-lingual synthesis. In MOS evaluation, SANE-TTS achieves naturalness score above 3.80 both in cross-lingual and intralingual synthesis, where the ground truth score is 3.99. Also, SANE-TTS maintains speaker similarity close to that of ground truth even in cross-lingual inference. Audio samples are available on our web page.
△ Less
Submitted 24 June, 2022;
originally announced June 2022.
-
Talking Face Generation with Multilingual TTS
Authors:
Hyoung-Kyu Song,
Sang Hoon Woo,
Junhyeok Lee,
Seungmin Yang,
Hyunjae Cho,
Youseong Lee,
Dongho Choi,
Kang-wook Kim
Abstract:
In this work, we propose a joint system combining a talking face generation system with a text-to-speech system that can generate multilingual talking face videos from only the text input. Our system can synthesize natural multilingual speeches while maintaining the vocal identity of the speaker, as well as lip movements synchronized to the synthesized speech. We demonstrate the generalization cap…
▽ More
In this work, we propose a joint system combining a talking face generation system with a text-to-speech system that can generate multilingual talking face videos from only the text input. Our system can synthesize natural multilingual speeches while maintaining the vocal identity of the speaker, as well as lip movements synchronized to the synthesized speech. We demonstrate the generalization capabilities of our system by selecting four languages (Korean, English, Japanese, and Chinese) each from a different language family. We also compare the outputs of our talking face generation model to outputs of a prior work that claims multilingual support. For our demo, we add a translation API to the preprocessing stage and present it in the form of a neural dubber so that users can utilize the multilingual property of our system more easily.
△ Less
Submitted 12 May, 2022;
originally announced May 2022.
-
Nanoscale Topographical Replication of Graphene Architecture by Artificial DNA nanostructures
Authors:
Y. Moon,
J. Shin,
S. Seo,
J. Park,
S. R. Dugasani,
S. H. Woo,
T. Park,
S. H. Park,
J. R. Ahn
Abstract:
Despite many studies on how geometry can be used to control the electronic properties of graphene, certain limitations to fabrication of designed graphene nanostructures exist. Here, we demonstrate controlled topographical replication of graphene by artificial deoxyribonucleic acid (DNA) nanostructures. Owing to the high degree of geometrical freedom of DNA nanostructures, we controlled the nanosc…
▽ More
Despite many studies on how geometry can be used to control the electronic properties of graphene, certain limitations to fabrication of designed graphene nanostructures exist. Here, we demonstrate controlled topographical replication of graphene by artificial deoxyribonucleic acid (DNA) nanostructures. Owing to the high degree of geometrical freedom of DNA nanostructures, we controlled the nanoscale topography of graphene. The topography of graphene replicated from DNA nanostructures showed enhanced thermal stability and revealed an interesting negative temperature coefficient of sheet resistivity when underlying DNA nanostructures were denatured at high temperatures.
△ Less
Submitted 12 June, 2014;
originally announced June 2014.