Skip to main content

Showing 1–41 of 41 results for author: Zen, H

.
  1. arXiv:2406.15401  [pdf, ps, other

    physics.ins-det nucl-ex

    Circular polarization measurement for individual gamma rays in capture reactions with intense pulsed neutrons

    Authors: S. Endo, R. Abe, H. Fujioka, T. Ino, O. Iwamoto, N. Iwamoto, S. Kawamura, A. Kimura, M. Kitaguchi, R. Kobayashi, S. Nakamura, T. Oku T. Okudaira, M. Okuizumi, M. Omer, G. Rovira, T. Shima, H. M. Shimizu, T. Shizuma, Y. Taira, S. Takada, S. Takahashi, H. Yoshikawa, T. Yoshioka, H. Zen

    Abstract: Measurements of circular polarization of $γ$-ray emitted from neutron capture reactions provide valuable information for nuclear physics studies. The spin and parity of excited states can be determined by measuring the circular polarization from polarized neutron capture reactions. Furthermore, the $γ$-ray circular polarization in a neutron capture resonance is crucial for studying the enhancement… ▽ More

    Submitted 7 May, 2024; originally announced June 2024.

    Comments: 10pages, 13 figures

  2. arXiv:2406.02133  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    SimulTron: On-Device Simultaneous Speech to Speech Translation

    Authors: Alex Agranovich, Eliya Nachmani, Oleg Rybakov, Yifan Ding, Ye Jia, Nadav Bar, Heiga Zen, Michelle Tadmor Ramanovich

    Abstract: Simultaneous speech-to-speech translation (S2ST) holds the promise of breaking down communication barriers and enabling fluid conversations across languages. However, achieving accurate, real-time translation through mobile devices remains a major challenge. We introduce SimulTron, a novel S2ST architecture designed to tackle this task. SimulTron is a lightweight direct S2ST model that uses the st… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  3. arXiv:2403.05530  [pdf, other

    cs.CL cs.AI

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More

    Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

  4. arXiv:2402.18932  [pdf, other

    eess.AS cs.SD

    Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

    Authors: Takaaki Saeki, Gary Wang, Nobuyuki Morioka, Isaac Elias, Kyle Kastner, Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, Françoise Beaufays, Hadar Shemtov

    Abstract: Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

    Comments: To appear in ICASSP 2024

  5. arXiv:2306.07580  [pdf, other

    cs.RO

    SayTap: Language to Quadrupedal Locomotion

    Authors: Yu** Tang, Wenhao Yu, Jie Tan, Heiga Zen, Aleksandra Faust, Tatsuya Harada

    Abstract: Large language models (LLMs) have demonstrated the potential to perform high-level planning. Yet, it remains a challenge for LLMs to comprehend low-level commands, such as joint angle targets or motor torques. This paper proposes an approach to use foot contact patterns as an interface that bridges human commands in natural language and a locomotion controller that outputs these low-level commands… ▽ More

    Submitted 14 September, 2023; v1 submitted 13 June, 2023; originally announced June 2023.

  6. arXiv:2305.18802  [pdf, other

    eess.AS cs.SD

    LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus

    Authors: Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang, Wei Han, Ankur Bapna

    Abstract: This paper introduces a new speech dataset called ``LibriTTS-R'' designed for text-to-speech (TTS) use. It is derived by applying speech restoration to the LibriTTS corpus, which consists of 585 hours of speech data at 24 kHz sampling rate from 2,456 speakers and the corresponding texts. The constituent samples of LibriTTS-R are identical to those of LibriTTS, with only the sound quality improved.… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023

  7. arXiv:2305.17547  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Translatotron 3: Speech to Speech Translation with Monolingual Data

    Authors: Eliya Nachmani, Alon Levkovitch, Yifan Ding, Chulayuth Asawaroengchai, Heiga Zen, Michelle Tadmor Ramanovich

    Abstract: This paper presents Translatotron 3, a novel approach to unsupervised direct speech-to-speech translation from monolingual speech-text datasets by combining masked autoencoder, unsupervised embedding map**, and back-translation. Experimental results in speech-to-speech translation tasks between Spanish and English show that Translatotron 3 outperforms a baseline cascade system, reporting… ▽ More

    Submitted 16 January, 2024; v1 submitted 27 May, 2023; originally announced May 2023.

    Comments: To appear in ICASSP 2024

  8. Quantum entanglement for continuous variables sharing in an expanding spacetime

    Authors: Wen-Mei Li, Rui-Di Wang, Hao-Yu Wu, Xiao-Li Huang, Hao-Sheng Zen, Shu-Min Wu

    Abstract: Detecting the structure of spacetime with quantum technologies has always been one of the frontier topics of relativistic quantum information. Here, we analytically study the generation and redistribution of Gaussian entanglement of the scalar fields in an expanding spacetime. We consider a two-mode squeezed state via a Gaussian amplification channel that corresponds to the time-evolution of the s… ▽ More

    Submitted 17 March, 2023; originally announced March 2023.

    Comments: 17 pages, 4 figures

    Journal ref: Eur. Phys. J. C (2023) 83:222

  9. arXiv:2303.01664  [pdf, other

    cs.SD cs.LG eess.AS

    Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations

    Authors: Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Yu Zhang, Wei Han, Ankur Bapna, Michiel Bacchiani

    Abstract: Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones. In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application: increasing the amount of high-quality training data for speech generation by converting speech samples collected from the Web to studio-quality. To make our SR model robust against various degradation,… ▽ More

    Submitted 14 August, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

    Comments: Accepted to WASPAA 2023

  10. arXiv:2210.15868  [pdf, other

    cs.SD cs.CL eess.AS

    Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation

    Authors: Nobuyuki Morioka, Heiga Zen, Nanxin Chen, Yu Zhang, Yifan Ding

    Abstract: Adapting a neural text-to-speech (TTS) model to a target speaker typically involves fine-tuning most if not all of the parameters of a pretrained multi-speaker backbone model. However, serving hundreds of fine-tuned neural TTS models is expensive as each of them requires significant footprint and separate computational resources (e.g., accelerators, memory). To scale speaker adapted neural TTS voi… ▽ More

    Submitted 27 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023

  11. arXiv:2210.15447  [pdf, other

    cs.SD cs.CL eess.AS

    Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech

    Authors: Takaaki Saeki, Heiga Zen, Zhehuai Chen, Nobuyuki Morioka, Gary Wang, Yu Zhang, Ankur Bapna, Andrew Rosenberg, Bhuvana Ramabhadran

    Abstract: This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech-text paired da… ▽ More

    Submitted 15 March, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: To appear in ICASSP 2023

  12. arXiv:2210.01029  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration

    Authors: Yuma Koizumi, Kohei Yatabe, Heiga Zen, Michiel Bacchiani

    Abstract: Denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs) are popular generative models for neural vocoders. The DDPMs and GANs can be characterized by the iterative denoising framework and adversarial training, respectively. This study proposes a fast and high-quality neural vocoder called \textit{WaveFit}, which integrates the essence of GANs into a DDPM-like it… ▽ More

    Submitted 3 October, 2022; originally announced October 2022.

    Comments: Accepted to IEEE SLT 2022

  13. arXiv:2208.13183  [pdf, other

    cs.SD eess.AS

    Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

    Authors: Lev Finkelstein, Heiga Zen, Norman Casagrande, Chun-an Chan, Ye Jia, Tom Kenter, Alexey Petelin, Jonathan Shen, Vincent Wan, Yu Zhang, Yonghui Wu, Rob Clark

    Abstract: Transfer tasks in text-to-speech (TTS) synthesis - where one or more aspects of the speech of one set of speakers is transferred to another set of speakers that do not feature these aspects originally - remains a challenging task. One of the challenges is that models that have high-quality transfer capabilities can have issues in stability, making them impractical for user-facing critical tasks. T… ▽ More

    Submitted 28 August, 2022; originally announced August 2022.

    Comments: To be published in Interspeech 2022

  14. arXiv:2208.11091  [pdf

    physics.acc-ph physics.app-ph physics.optics

    Full characterization of few-cycle superradiant pulses generated from a free-electron laser oscillator

    Authors: Heishun Zen, Ryoichi Hajima, Hideaki Ohgaki

    Abstract: The detailed structure of few-cycle superradiant pulses generated from a free-electron laser (FEL) oscillator was experimentally revealed for the first time. As predicted in numerical simulations, the obtained pulse structures have ringing, i.e., a train of subpulses following the main pulse. Owing to the phase retrieval with a combination of linear and nonlinear autocorrelation measurements, the… ▽ More

    Submitted 17 August, 2022; originally announced August 2022.

    Comments: Main: 16 pages, 5 figures, Supplementary material: 10 pages, 10 figures, 1 video

    Journal ref: Sci.Rep. 13 (2023) 6350

  15. arXiv:2204.03409  [pdf, other

    cs.CL cs.SD eess.AS

    MAESTRO: Matched Speech Text Representations through Modality Matching

    Authors: Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro Moreno, Ankur Bapna, Heiga Zen

    Abstract: We present Maestro, a self-supervised training method to unify representations learnt from speech and text modalities. Self-supervised learning from speech signals aims to learn the latent structure inherent in the signal, while self-supervised learning from text attempts to capture lexical information. Learning aligned representations from unpaired speech and text sequences is a challenging task.… ▽ More

    Submitted 1 July, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

    Comments: Accepted by Interspeech 2022

    MSC Class: 68T10 ACM Class: I.2.7

  16. arXiv:2203.16749  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Sha**

    Authors: Yuma Koizumi, Heiga Zen, Kohei Yatabe, Nanxin Chen, Michiel Bacchiani

    Abstract: Neural vocoder using denoising diffusion probabilistic model (DDPM) has been improved by adaptation of the diffusion noise distribution to given acoustic features. In this study, we propose SpecGrad that adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. This adaptation by time-varying filtering improves the sound quality es… ▽ More

    Submitted 4 August, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: Accepted to Interspeech 2022

  17. arXiv:2201.03713  [pdf, other

    cs.CL cs.SD eess.AS

    CVSS Corpus and Massively Multilingual Speech-to-Speech Translation

    Authors: Ye Jia, Michelle Tadmor Ramanovich, Quan Wang, Heiga Zen

    Abstract: We introduce CVSS, a massively multilingual-to-English speech-to-speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into English. CVSS is derived from the Common Voice speech corpus and the CoVoST 2 speech-to-text translation (ST) corpus, by synthesizing the translation text from CoVoST 2 into speech using state-of-the-art TTS systems. Two versions of t… ▽ More

    Submitted 26 June, 2022; v1 submitted 10 January, 2022; originally announced January 2022.

    Comments: LREC 2022

  18. arXiv:2106.09660  [pdf, ps, other

    eess.AS cs.LG cs.SD

    WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

    Authors: Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, Najim Dehak, William Chan

    Abstract: This paper introduces WaveGrad 2, a non-autoregressive generative model for text-to-speech synthesis. WaveGrad 2 is trained to estimate the gradient of the log conditional density of the waveform given a phoneme sequence. The model takes an input phoneme sequence, and through an iterative refinement process, generates an audio waveform. This contrasts to the original WaveGrad vocoder which conditi… ▽ More

    Submitted 18 June, 2021; v1 submitted 17 June, 2021; originally announced June 2021.

    Comments: Proceedings of INTERSPEECH

  19. arXiv:2103.15060  [pdf, other

    cs.CL cs.SD eess.AS

    PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS

    Authors: Ye Jia, Heiga Zen, Jonathan Shen, Yu Zhang, Yonghui Wu

    Abstract: This paper introduces PnG BERT, a new encoder model for neural TTS. This model is augmented from the original BERT model, by taking both phoneme and grapheme representations of text as input, as well as the word-level alignment between them. It can be pre-trained on a large text corpus in a self-supervised manner, and fine-tuned in a TTS task. Experimental results show that a neural TTS model usin… ▽ More

    Submitted 7 June, 2021; v1 submitted 28 March, 2021; originally announced March 2021.

    Comments: Accepted to Interspeech 2021

  20. arXiv:2103.14574  [pdf, other

    cs.SD eess.AS

    Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

    Authors: Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, RJ Skerry-Ryan, Yonghui Wu

    Abstract: This paper introduces Parallel Tacotron 2, a non-autoregressive neural text-to-speech model with a fully differentiable duration model which does not require supervised duration signals. The duration model is based on a novel attention mechanism and an iterative reconstruction loss based on Soft Dynamic Time War**, this model can learn token-frame alignments as well as token durations automatica… ▽ More

    Submitted 29 August, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

    Comments: Submitted to INTERSPEECH 2021

  21. arXiv:2010.11439  [pdf, other

    cs.SD eess.AS

    Parallel Tacotron: Non-Autoregressive and Controllable TTS

    Authors: Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, Ron Weiss, Yonghui Wu

    Abstract: Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still room for improvements to its efficiency and naturalness. This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, called \emph{Parallel Tacotron}, is highly parallelizable during both training and inference, a… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

  22. arXiv:2010.04301  [pdf, other

    cs.SD cs.CL

    Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling

    Authors: Jonathan Shen, Ye Jia, Mike Chrzanowski, Yu Zhang, Isaac Elias, Heiga Zen, Yonghui Wu

    Abstract: This paper presents Non-Attentive Tacotron based on the Tacotron 2 text-to-speech model, replacing the attention mechanism with an explicit duration predictor. This improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two metrics introduced in this paper for large-scale robustness evaluation using a pre-trained speech recognition model. With the use of… ▽ More

    Submitted 11 May, 2021; v1 submitted 8 October, 2020; originally announced October 2020.

  23. arXiv:2009.00713  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    WaveGrad: Estimating Gradients for Waveform Generation

    Authors: Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, William Chan

    Abstract: This paper introduces WaveGrad, a conditional model for waveform generation which estimates gradients of the data density. The model is built on prior work on score matching and diffusion probabilistic models. It starts from a Gaussian white noise signal and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram. WaveGrad offers a natural way to trade infere… ▽ More

    Submitted 9 October, 2020; v1 submitted 2 September, 2020; originally announced September 2020.

  24. arXiv:2002.03788  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior

    Authors: Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Andrew Rosenberg, Bhuvana Ramabhadran, Yonghui Wu

    Abstract: Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody of synthesized speech. Such models typically incorporate a fine-grained variational autoencoder (VAE) structure, extracting latent features at each input token (e.g., phonemes). However, generating samples with the standard VAE prior often results in unnatural and discontinuous speech,… ▽ More

    Submitted 6 February, 2020; originally announced February 2020.

    Comments: To appear in ICASSP 2020

  25. arXiv:2002.03785  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis

    Authors: Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Yonghui Wu

    Abstract: This paper proposes a hierarchical, fine-grained and interpretable latent variable model for prosody based on the Tacotron 2 text-to-speech model. It achieves multi-resolution modeling of prosody by conditioning finer level representations on coarser level ones. Additionally, it imposes hierarchical conditioning across all latent dimensions using a conditional variational auto-encoder (VAE) with a… ▽ More

    Submitted 6 February, 2020; originally announced February 2020.

    Comments: to appear in ICASSP 2020

  26. arXiv:1907.04448  [pdf, other

    cs.CL cs.SD eess.AS

    Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

    Authors: Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, RJ Skerry-Ryan, Ye Jia, Andrew Rosenberg, Bhuvana Ramabhadran

    Abstract: We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages. Moreover, the model is able to transfer voices across languages, e.g. synthesize fluent Spanish speech using an English speaker's voice, without training on any bilingual or parallel examples. Such transfer works across distantly related… ▽ More

    Submitted 24 July, 2019; v1 submitted 9 July, 2019; originally announced July 2019.

    Comments: 5 pages, submitted to Interspeech 2019

  27. arXiv:1904.02882  [pdf, other

    cs.SD eess.AS

    LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

    Authors: Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, Yonghui Wu

    Abstract: This paper introduces a new speech corpus called "LibriTTS" designed for text-to-speech use. It is derived from the original audio and text materials of the LibriSpeech corpus, which has been used for training and evaluating automatic speech recognition systems. The new corpus inherits desired properties of the LibriSpeech corpus while addressing a number of issues which make LibriSpeech less than… ▽ More

    Submitted 5 April, 2019; originally announced April 2019.

    Comments: Submitted for Interspeech 2019, 7 pages

  28. arXiv:1902.08295  [pdf, other

    cs.LG stat.ML

    Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling

    Authors: Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X. Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, Yanzhang He, Jan Chorowski, Smit Hinsu, Stella Laurenzo, James Qin, Orhan Firat, Wolfgang Macherey, Suyog Gupta, Ankur Bapna, Shuyuan Zhang, Ruoming Pang, Ron J. Weiss, Rohit Prabhavalkar, Qiao Liang, Benoit Jacob , et al. (66 additional authors not shown)

    Abstract: Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed training and quantized inference are supported directly w… ▽ More

    Submitted 21 February, 2019; originally announced February 2019.

  29. arXiv:1810.07217  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Hierarchical Generative Modeling for Controllable Speech Synthesis

    Authors: Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming Pang

    Abstract: This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions. The model is formulated as a conditional generative model based on the variational autoencoder (VAE) framework, with two levels of hierarch… ▽ More

    Submitted 27 December, 2018; v1 submitted 16 October, 2018; originally announced October 2018.

    Comments: 27 pages, accepted to ICLR 2019

  30. arXiv:1809.10460  [pdf, other

    cs.LG cs.SD stat.ML

    Sample Efficient Adaptive Text-to-Speech

    Authors: Yutian Chen, Yannis Assael, Brendan Shillingford, David Budden, Scott Reed, Heiga Zen, Quan Wang, Luis C. Cobo, Andrew Trask, Ben Laurie, Caglar Gulcehre, Aäron van den Oord, Oriol Vinyals, Nando de Freitas

    Abstract: We present a meta-learning approach for adaptive text-to-speech (TTS) with few data. During training, we learn a multi-speaker model using a shared conditional WaveNet core and independent learned embeddings for each speaker. The aim of training is not to produce a neural network with fixed weights, which is then deployed as a TTS system. Instead, the aim is to produce a network that requires few… ▽ More

    Submitted 16 January, 2019; v1 submitted 27 September, 2018; originally announced September 2018.

    Comments: Accepted by ICLR 2019

  31. Big Data and Privacy Issues for Connected Vehicles in Intelligent Transportation Systems

    Authors: Adnan Mahmood, Hushairi Zen, Shadi M. S. Hilles

    Abstract: The evolution of Big Data in large-scale Internet-of-Vehicles has brought forward unprecedented opportunities for a unified management of the transportation sector, and for devising smart Intelligent Transportation Systems. Nevertheless, such form of frequent heterogeneous data collection between the vehicles and numerous applications platforms via diverse radio access technologies has led to a nu… ▽ More

    Submitted 7 June, 2018; originally announced June 2018.

    Journal ref: Mahmood A., Zen H., Hilles S.M.S. (2018) Big Data and Privacy Issues for Connected Vehicles in Intelligent Transportation Systems. In: Sakr S., Zomaya A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham

  32. arXiv:1712.05266  [pdf

    eess.SP

    ICT Convergence in Internet of Things - The Birth of Smart Factories (A Technical Note)

    Authors: Mahmood Adnan, Hushairi Zen

    Abstract: Over the past decade, most factories across developed parts of the world employ a varying amount of the manufacturing technologies including autonomous robots, RFID (radio frequency identification) technology, NCs (numerically controlled machines), wireless sensor networks embedded with specialized computerized softwares for sophisticated product designs, engineering analysis, and remote control o… ▽ More

    Submitted 29 November, 2017; originally announced December 2017.

    Journal ref: International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 4, April 2016, pp. 93

  33. arXiv:1711.10643  [pdf

    cs.IT

    A Review on Cooperative Diversity Techniques Bypassing Channel Estimation

    Authors: Sylvia Ong Ai Ling, Hushairi Zen, Al-Khalid B Hj Othman, Mahmood Adnan, Olalekan Bello

    Abstract: Wireless communication technology has seen a remarkably fast evolution due to its capability to provide a quality, reliable and high-speed data transmission amongst the users. However, transmission of information in wireless channels is primarily impaired by deleterious multipath fading, which affects the quality and reliability of the system. In order to overcome the detrimental effects of fading… ▽ More

    Submitted 28 November, 2017; originally announced November 2017.

    Journal ref: Canadian Journal of Pure and Applied Sciences, Vol. 10, No. 1, February 2016, pp.3777-3783

  34. arXiv:1711.10433  [pdf, other

    cs.LG

    Parallel WaveNet: Fast High-Fidelity Speech Synthesis

    Authors: Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C. Cobo, Florian Stimberg, Norman Casagrande, Dominik Grewe, Seb Noury, Sander Dieleman, Erich Elsen, Nal Kalchbrenner, Heiga Zen, Alex Graves, Helen King, Tom Walters, Dan Belov, Demis Hassabis

    Abstract: The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous system. However, because WaveNet relies on sequential generation of one audio sample at a time, it is poorly suited to today's massively parallel computers, and therefore hard to deploy in a real-time p… ▽ More

    Submitted 28 November, 2017; originally announced November 2017.

  35. arXiv:1609.03499  [pdf, other

    cs.SD cs.LG

    WaveNet: A Generative Model for Raw Audio

    Authors: Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu

    Abstract: This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-… ▽ More

    Submitted 19 September, 2016; v1 submitted 12 September, 2016; originally announced September 2016.

  36. arXiv:1606.06061  [pdf, other

    cs.SD cs.CL

    Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices

    Authors: Heiga Zen, Yannis Agiomyrgiannakis, Niels Egberts, Fergus Henderson, Przemysław Szczepaniak

    Abstract: Acoustic models based on long short-term memory recurrent neural networks (LSTM-RNNs) were applied to statistical parametric speech synthesis (SPSS) and showed significant improvements in naturalness and latency over those based on hidden Markov models (HMMs). This paper describes further optimizations of LSTM-RNN-based SPSS for deployment on mobile devices; weight quantization, multi-frame infere… ▽ More

    Submitted 22 June, 2016; v1 submitted 20 June, 2016; originally announced June 2016.

    Comments: 13 pages, 3 figures, Interspeech 2016 (accepted)

  37. arXiv:1605.07809  [pdf, ps, other

    cs.SD eess.AS eess.SP

    Using instantaneous frequency and aperiodicity detection to estimate F0 for high-quality speech synthesis

    Authors: Hideki Kawahara, Yannis Agiomyrgiannakis, Heiga Zen

    Abstract: This paper introduces a general and flexible framework for F0 and aperiodicity (additive non periodic component) analysis, specifically intended for high-quality speech synthesis and modification applications. The proposed framework consists of three subsystems: instantaneous frequency estimator and initial aperiodicity detector, F0 trajectory tracker, and F0 refinement and aperiodicity extractor.… ▽ More

    Submitted 22 July, 2016; v1 submitted 25 May, 2016; originally announced May 2016.

    Comments: Accepted for presentation in ISCA workshop SSW9

    Journal ref: 9th ISCA Speech Synthesis Workshop, 2016, pp.221-228

  38. arXiv:1509.00700  [pdf

    physics.acc-ph

    Present status of source development station at UVSOR-III

    Authors: Najmeh Sadat Mirian, Jun-ichiro Yamazaki, Kenji Hayashi, Masahiro Katoh, Masahito Hosaka, Yoshifumi Takashima, Naoto Yamamoto, Taro Konomi, Heishun Zen

    Abstract: Construction and development of a source development station are in progress at UVSOR-III, a 750 MeV electron storage ring. It is equipped with an optical klystron type undulator system, a mode lock Ti:Sa Laser system, a dedicated beam-line for visible-VUV radiation and a parasitic beam-line for THz radiation. New light port to extract edge radiation was constructed recently. An optical cavity for… ▽ More

    Submitted 2 September, 2015; originally announced September 2015.

    Comments: 3 pages, 3 figures

  39. arXiv:1310.7678  [pdf, ps, other

    physics.optics

    Damage threshold and focusability of mid-infrared free-electron laser pulses gated by a plasma mirror with nanosecond switching pulses

    Authors: Xiaolong Wang, Takashi Nakajima, Heishun Zen, Toshiteru Kii, Hideaki Ohgaki

    Abstract: The presence of a pulse train structure of an oscillator-type free-electron laser (FEL) results in the immediate damage of a solid target upon focusing. We demonstrate that the laser-induced damage threshold can be significantly improved by gating the mid-infrared (MIR) FEL pulses with a plasma mirror. Although the switching pulses we employ have a nanosecond duration which does not guarantee the… ▽ More

    Submitted 29 October, 2013; originally announced October 2013.

  40. arXiv:1308.6026  [pdf, ps, other

    physics.optics

    Diagnosis of the Wavelength Stability of a Mid-Infrared Free-Electron Laser

    Authors: Xiaolong Wang, Yu Qin, Takashi Nakajima, Heishun Zen, Toshiteru Kii, Hideaki Ohgaki

    Abstract: Wavelength stability of free-electron lasers (FELs) is one of the important parameters for various applications. In this paper we describe two different methods to diagnose the wavelength stability of a mid-infrared (MIR) FEL. The first one is based on autocorrelation which is usually used to measure the pulse duration, and the second one is based on frequency upconversion through sum-frequency mi… ▽ More

    Submitted 27 August, 2013; originally announced August 2013.

  41. arXiv:1211.2557  [pdf, ps, other

    physics.optics

    Single-shot spectra of temporally selected micropulses from a mid-infrared free-electron laser by upconversion

    Authors: Xiaolong Wang, Takashi Nakajima, Heishun Zen, Toshiteru Kii, Hideaki Ohgaki

    Abstract: We demonstrate the measurement of single-shot spectra of temporally selected micropulses from a mid-infrared (MIR) free-electron laser (FEL) by upconversion. We achieve the upconversion of FEL pulses at 11 μm using externally synchronized Nd:YAG or microchip laser pulses at 1064 nm to produce sum-frequency mixing (SFM) signals at 970 nm, which are detected by a compact CCD spectrometer without an… ▽ More

    Submitted 26 November, 2012; v1 submitted 12 November, 2012; originally announced November 2012.

    Comments: 3 pages, 4 figures