-
Resource-Efficient Neural Architect
Authors:
Yanqi Zhou,
Siavash Ebrahimi,
Sercan Ö. Arık,
Haonan Yu,
Hairong Liu,
Greg Diamos
Abstract:
Neural Architecture Search (NAS) is a laborious process. Prior work on automated NAS targets mainly on improving accuracy, but lacks consideration of computational resource use. We propose the Resource-Efficient Neural Architect (RENA), an efficient resource-constrained NAS using reinforcement learning with network embedding. RENA uses a policy network to process the network embeddings to generate…
▽ More
Neural Architecture Search (NAS) is a laborious process. Prior work on automated NAS targets mainly on improving accuracy, but lacks consideration of computational resource use. We propose the Resource-Efficient Neural Architect (RENA), an efficient resource-constrained NAS using reinforcement learning with network embedding. RENA uses a policy network to process the network embeddings to generate new configurations. We demonstrate RENA on image recognition and keyword spotting (KWS) problems. RENA can find novel architectures that achieve high performance even with tight resource constraints. For CIFAR10, it achieves 2.95% test error when compute intensity is greater than 100 FLOPs/byte, and 3.87% test error when model size is less than 3M parameters. For Google Speech Commands Dataset, RENA achieves the state-of-the-art accuracy without resource constraints, and it outperforms the optimized architectures with tight resource constraints.
△ Less
Submitted 12 June, 2018;
originally announced June 2018.
-
Neural Voice Cloning with a Few Samples
Authors:
Sercan O. Arik,
Jitong Chen,
Kainan Peng,
Wei **,
Yanqi Zhou
Abstract:
Voice cloning is a highly desired feature for personalized speech interfaces. Neural network based speech synthesis has been shown to generate high quality speech for a large number of speakers. In this paper, we introduce a neural voice cloning system that takes a few audio samples as input. We study two approaches: speaker adaptation and speaker encoding. Speaker adaptation is based on fine-tuni…
▽ More
Voice cloning is a highly desired feature for personalized speech interfaces. Neural network based speech synthesis has been shown to generate high quality speech for a large number of speakers. In this paper, we introduce a neural voice cloning system that takes a few audio samples as input. We study two approaches: speaker adaptation and speaker encoding. Speaker adaptation is based on fine-tuning a multi-speaker generative model with a few cloning samples. Speaker encoding is based on training a separate model to directly infer a new speaker embedding from cloning audios and to be used with a multi-speaker generative model. In terms of naturalness of the speech and its similarity to original speaker, both approaches can achieve good performance, even with very few cloning audios. While speaker adaptation can achieve better naturalness and similarity, the cloning time or required memory for the speaker encoding approach is significantly less, making it favorable for low-resource deployment.
△ Less
Submitted 12 October, 2018; v1 submitted 14 February, 2018;
originally announced February 2018.
-
Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning
Authors:
Wei **,
Kainan Peng,
Andrew Gibiansky,
Sercan O. Arik,
Ajay Kannan,
Sharan Narang,
Jonathan Raiman,
John Miller
Abstract:
We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. We scale Deep Voice 3 to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common erro…
▽ More
We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. We scale Deep Voice 3 to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare several different waveform synthesis methods. We also describe how to scale inference to ten million queries per day on one single-GPU server.
△ Less
Submitted 22 February, 2018; v1 submitted 20 October, 2017;
originally announced October 2017.
-
Low-complexity implementation of convex optimization-based phase retrieval
Authors:
Sercan O. Arik,
Joseph M. Kahn
Abstract:
Phase retrieval has important applications in optical imaging, communications and sensing. Lifting the dimensionality of the problem allows phase retrieval to be approximated as a convex optimization problem in a higher-dimensional space. Convex optimization-based phase retrieval has been shown to yield high accuracy, yet its low-complexity implementation has not been explored. In this paper, we s…
▽ More
Phase retrieval has important applications in optical imaging, communications and sensing. Lifting the dimensionality of the problem allows phase retrieval to be approximated as a convex optimization problem in a higher-dimensional space. Convex optimization-based phase retrieval has been shown to yield high accuracy, yet its low-complexity implementation has not been explored. In this paper, we study three fundamental approaches for its low-complexity implementation: the projected gradient method, the Nesterov accelerated gradient method, and the alternating direction method of multipliers (ADMM) method. We derive the corresponding estimation algorithms and evaluate their complexities. We compare their performance in the application area of direct-detection mode-division multiplexing. We demonstrate that they yield negligible estimation penalties (less than 0.2 dB for transmitter processing and less than 0.6 dB for receiver equalization) while yielding low computational cost, as their implementation complexities all scale quadratically in the number of unknown parameters. Among the three methods, ADMM achieves convergence after the smallest number of iterations.
△ Less
Submitted 19 March, 2018; v1 submitted 18 July, 2017;
originally announced July 2017.
-
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
Authors:
Sercan Arik,
Gregory Diamos,
Andrew Gibiansky,
John Miller,
Kainan Peng,
Wei **,
Jonathan Raiman,
Yanqi Zhou
Abstract:
We introduce a technique for augmenting neural text-to-speech (TTS) with lowdimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-ofthe-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constr…
▽ More
We introduce a technique for augmenting neural text-to-speech (TTS) with lowdimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-ofthe-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1. We improve Tacotron by introducing a post-processing neural vocoder, and demonstrate a significant audio quality improvement. We then demonstrate our technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS datasets. We show that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.
△ Less
Submitted 20 September, 2017; v1 submitted 24 May, 2017;
originally announced May 2017.
-
Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting
Authors:
Sercan O. Arik,
Markus Kliegl,
Rewon Child,
Joel Hestness,
Andrew Gibiansky,
Chris Fougner,
Ryan Prenger,
Adam Coates
Abstract:
Keyword spotting (KWS) constitutes a major component of human-technology interfaces. Maximizing the detection accuracy at a low false alarm (FA) rate, while minimizing the footprint size, latency and complexity are the goals for KWS. Towards achieving them, we study Convolutional Recurrent Neural Networks (CRNNs). Inspired by large-scale state-of-the-art speech recognition systems, we combine the…
▽ More
Keyword spotting (KWS) constitutes a major component of human-technology interfaces. Maximizing the detection accuracy at a low false alarm (FA) rate, while minimizing the footprint size, latency and complexity are the goals for KWS. Towards achieving them, we study Convolutional Recurrent Neural Networks (CRNNs). Inspired by large-scale state-of-the-art speech recognition systems, we combine the strengths of convolutional layers and recurrent layers to exploit local structure and long-range context. We analyze the effect of architecture parameters, and propose training strategies to improve performance. With only ~230k parameters, our CRNN model yields acceptably low latency, and achieves 97.71% accuracy at 0.5 FA/hour for 5 dB signal-to-noise ratio.
△ Less
Submitted 4 July, 2017; v1 submitted 15 March, 2017;
originally announced March 2017.
-
Deep Voice: Real-time Neural Text-to-Speech
Authors:
Sercan O. Arik,
Mike Chrzanowski,
Adam Coates,
Gregory Diamos,
Andrew Gibiansky,
Yongguo Kang,
Xian Li,
John Miller,
Andrew Ng,
Jonathan Raiman,
Shubho Sengupta,
Mohammad Shoeybi
Abstract:
We present Deep Voice, a production-quality text-to-speech system constructed entirely from deep neural networks. Deep Voice lays the groundwork for truly end-to-end neural speech synthesis. The system comprises five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency predi…
▽ More
We present Deep Voice, a production-quality text-to-speech system constructed entirely from deep neural networks. Deep Voice lays the groundwork for truly end-to-end neural speech synthesis. The system comprises five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For the segmentation model, we propose a novel way of performing phoneme boundary detection with deep neural networks using connectionist temporal classification (CTC) loss. For the audio synthesis model, we implement a variant of WaveNet that requires fewer parameters and trains faster than the original. By using a neural network for each component, our system is simpler and more flexible than traditional text-to-speech systems, where each component requires laborious feature engineering and extensive domain expertise. Finally, we show that inference with our system can be performed faster than real time and describe optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.
△ Less
Submitted 7 March, 2017; v1 submitted 24 February, 2017;
originally announced February 2017.
-
Supervised classification-based stock prediction and portfolio optimization
Authors:
Sercan Arik,
Sukru Burc Eryilmaz,
Adam Goldberg
Abstract:
As the number of publicly traded companies as well as the amount of their financial data grows rapidly, it is highly desired to have tracking, analysis, and eventually stock selections automated. There have been few works focusing on estimating the stock prices of individual companies. However, many of those have worked with very small number of financial parameters. In this work, we apply machine…
▽ More
As the number of publicly traded companies as well as the amount of their financial data grows rapidly, it is highly desired to have tracking, analysis, and eventually stock selections automated. There have been few works focusing on estimating the stock prices of individual companies. However, many of those have worked with very small number of financial parameters. In this work, we apply machine learning techniques to address automated stock picking, while using a larger number of financial parameters for individual companies than the previous studies. Our approaches are based on the supervision of prediction parameters using company fundamentals, time-series properties, and correlation information between different stocks. We examine a variety of supervised learning techniques and found that using stock fundamentals is a useful approach for the classification problem, when combined with the high dimensional data handling capabilities of support vector machine. The portfolio our system suggests by predicting the behavior of stocks results in a 3% larger growth on average than the overall market within a 3-month time period, as the out-of-sample test suggests.
△ Less
Submitted 3 June, 2014;
originally announced June 2014.