Search | arXiv e-print repository

Noise2Music: Text-conditioned Music Generation with Diffusion Models

Authors: Qingqing Huang, Daniel S. Park, Tao Wang, Timo I. Denk, Andy Ly, Nanxin Chen, Zhengdong Zhang, Zhishuai Zhang, Jiahui Yu, Christian Frank, Jesse Engel, Quoc V. Le, William Chan, Zhifeng Chen, Wei Han

Abstract: We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and… ▽ More We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and utilized in succession to generate high-fidelity music. We explore two options for the intermediate representation, one using a spectrogram and the other using audio with lower fidelity. We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era, but goes beyond to ground fine-grained semantics of the prompt. Pretrained large language models play a key role in this story -- they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models. Generated examples: https://google-research.github.io/noise2music △ Less

Submitted 6 March, 2023; v1 submitted 8 February, 2023; originally announced February 2023.

Comments: 15 pages

arXiv:2210.15897 [pdf, other]

Single-Image HDR Reconstruction by Multi-Exposure Generation

Authors: Phuoc-Hieu Le, Quynh Le, Rang Nguyen, Binh-Son Hua

Abstract: High dynamic range (HDR) imaging is an indispensable technique in modern photography. Traditional methods focus on HDR reconstruction from multiple images, solving the core problems of image alignment, fusion, and tone map**, yet having a perfect solution due to ghosting and other visual artifacts in the reconstruction. Recent attempts at single-image HDR reconstruction show a promising alternat… ▽ More High dynamic range (HDR) imaging is an indispensable technique in modern photography. Traditional methods focus on HDR reconstruction from multiple images, solving the core problems of image alignment, fusion, and tone map**, yet having a perfect solution due to ghosting and other visual artifacts in the reconstruction. Recent attempts at single-image HDR reconstruction show a promising alternative: by learning to map pixel values to their irradiance using a neural network, one can bypass the align-and-merge pipeline completely yet still obtain a high-quality HDR image. In this work, we propose a weakly supervised learning method that inverts the physical image formation process for HDR reconstruction via learning to generate multiple exposures from a single image. Our neural network can invert the camera response to reconstruct pixel irradiance before synthesizing multiple exposures and hallucinating details in under- and over-exposed regions from a single input image. To train the network, we propose a representation loss, a reconstruction loss, and a perceptual loss applied on pairs of under- and over-exposure images and thus do not require HDR images for training. Our experiments show that our proposed model can effectively reconstruct HDR images. Our qualitative and quantitative results show that our method achieves state-of-the-art performance on the DrTMO dataset. Our code is available at https://github.com/VinAIResearch/single_image_hdr. △ Less

Submitted 28 October, 2022; originally announced October 2022.

Comments: WACV 2023 paper. 8 pages of content, 2 pages of references, 8 pages of supplementary material

arXiv:2210.10879 [pdf, other]

G-Augment: Searching for the Meta-Structure of Data Augmentation Policies for ASR

Authors: Gary Wang, Ekin D. Cubuk, Andrew Rosenberg, Shuyang Cheng, Ron J. Weiss, Bhuvana Ramabhadran, Pedro J. Moreno, Quoc V. Le, Daniel S. Park

Abstract: Data augmentation is a ubiquitous technique used to provide robustness to automatic speech recognition (ASR) training. However, even as so much of the ASR training process has become automated and more "end-to-end", the data augmentation policy (what augmentation functions to use, and how to apply them) remains hand-crafted. We present Graph-Augment, a technique to define the augmentation space as… ▽ More Data augmentation is a ubiquitous technique used to provide robustness to automatic speech recognition (ASR) training. However, even as so much of the ASR training process has become automated and more "end-to-end", the data augmentation policy (what augmentation functions to use, and how to apply them) remains hand-crafted. We present Graph-Augment, a technique to define the augmentation space as directed acyclic graphs (DAGs) and search over this space to optimize the augmentation policy itself. We show that given the same computational budget, policies produced by G-Augment are able to perform better than SpecAugment policies obtained by random search on fine-tuning tasks on CHiME-6 and AMI. G-Augment is also able to establish a new state-of-the-art ASR performance on the CHiME-6 evaluation set (30.7% WER). We further demonstrate that G-Augment policies show better transfer properties across warm-start to cold-start training and model size compared to random-searched SpecAugment policies. △ Less

Submitted 24 October, 2022; v1 submitted 19 October, 2022; originally announced October 2022.

Comments: 6 pages, accepted at SLT 2022. Updated with copyright

arXiv:2202.12430 [pdf]

Koopman Spectral Analysis of Intermittent Dynamics in Complex Systems: A Case Study in Pathophysiological Processes of Obstructive Sleep Apnea

Authors: Phat K. Huynh, Arveity R. Setty, Trung Q. Le

Abstract: Complex systems, such as pathophysiological processes, commonly exhibit chaotic, nonlinear, and intermittent phenomena. Koopman operator theory and Hankel alternative view of Koopman (HAVOK) model have been widely used to decompose the chaos of the complex system dynamics into an intermittent forced linear system. Although the statistics of the intermittent forcing have been proposed to characteri… ▽ More Complex systems, such as pathophysiological processes, commonly exhibit chaotic, nonlinear, and intermittent phenomena. Koopman operator theory and Hankel alternative view of Koopman (HAVOK) model have been widely used to decompose the chaos of the complex system dynamics into an intermittent forced linear system. Although the statistics of the intermittent forcing have been proposed to characterize intermittencies in the HAVOK model, they were not adequate to attribute for the mode switching of nonlinear dynamics and the fat-tailed non-Gaussian distribution originated from high-frequency bursts and rarely-observed intermittent forcing. The paper proposed a new intermittency dynamics analysis approach to characterize the intermittent phases, chaotic bursts, and local spectral-temporal properties of various intermittent dynamics modes using spectral decomposition and wavelet analysis. To validate our methods, the intermittency behavior of apneic events in obstructive sleep apnea disorder was selected as the case, in which heart rate variability (HRV) features were extracted. Next, we constructed the Hankel matrix from the HRV features and obtained the last eigen time-delay coordinate by singular value decomposition of the Hankel matrix, which was modeled as an intermittent forcing input. The statistics of the forcing in OSA demonstrated the fat-tailed distribution of the intermittent forcing, which correspond to the intermittency of the underlying OSA pathophysiological process. The pooled means and standard deviations of the burst duration and the inter-burst duration across OSA patients were also calculated to be minutes and minutes. Scalogram amplitude and spectral decomposition of the wavelet transform exhibited various predominant frequencies and dynamics modes associated with apneic events. △ Less

Submitted 24 February, 2022; originally announced February 2022.

Comments: 28 pages, 9 figures, 1 table

arXiv:2111.05761 [pdf]

A Probabilistic Domain-knowledge Framework for Nosocomial Infection Risk Estimation of Communicable Viral Diseases in Healthcare Personnel: A Case Study for COVID-19

Authors: Phat K. Huynh, Arveity R. Setty, Om P. Yadav, Trung Q. Le

Abstract: Hospital-acquired infections of communicable viral diseases (CVDs) are posing a tremendous challenge to healthcare workers globally. Healthcare personnel (HCP) is facing a consistent risk of hospital-acquired infections, and subsequently higher rates of morbidity and mortality. We proposed a domain knowledge-driven infection risk model to quantify the individual HCP and the population-level health… ▽ More Hospital-acquired infections of communicable viral diseases (CVDs) are posing a tremendous challenge to healthcare workers globally. Healthcare personnel (HCP) is facing a consistent risk of hospital-acquired infections, and subsequently higher rates of morbidity and mortality. We proposed a domain knowledge-driven infection risk model to quantify the individual HCP and the population-level healthcare facility risks. For individual-level risk estimation, a time-variant infection risk model is proposed to capture the transmission dynamics of CVDs. At the population-level, the infection risk is estimated using a Bayesian network model constructed from three feature sets including individual-level factors, engineering control factors, and administrative control factors. The sensitivity analyses indicated that the uncertainty in the individual infection risk can be attributed to two variables: the number of close contacts and the viral transmission probability. The model validation was implemented in the transmission probability model, individual level risk model, and population-level risk model using a Coronavirus disease case study. Regarding the first, multivariate logistic regression was applied for a cross-sectional data in the UK with an AIC value of 7317.70 and a 10-fold cross validation accuracy of 78.23%. For the second model, we collected laboratory-confirmed COVID-19 cases of HCP in different occupations. The occupation-specific risk evaluation suggested the highest-risk occupations were registered nurses, medical assistants, and respiratory therapists, with estimated risks of 0.0189, 0.0188, and 0.0176, respectively. To validate the population-level risk model, the infection risk in Texas and California was estimated. The proposed model will significantly influence the PPE allocation and safety plans for HCP △ Less

Submitted 4 November, 2021; originally announced November 2021.

Comments: 10 pages, 4 figures, Journal of Biomedical and Health Informatics

arXiv:2109.13226 [pdf, other]

doi 10.1109/JSTSP.2022.3182537

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Authors: Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yan** Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang , et al. (1 additional authors not shown)

Abstract: We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled da… ▽ More We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks. △ Less

Submitted 21 July, 2022; v1 submitted 27 September, 2021; originally announced September 2021.

Comments: 14 pages, 7 figures, 13 tables; v2: minor corrections, reference baselines and bibliography updated; v3: corrections based on reviewer feedback, bibliography updated

arXiv:2102.05610 [pdf, other]

Searching for Fast Model Families on Datacenter Accelerators

Authors: Sheng Li, Mingxing Tan, Ruoming Pang, Andrew Li, Liqun Cheng, Quoc Le, Norman P. Jouppi

Abstract: Neural Architecture Search (NAS), together with model scaling, has shown remarkable progress in designing high accuracy and fast convolutional architecture families. However, as neither NAS nor model scaling considers sufficient hardware architecture details, they do not take full advantage of the emerging datacenter (DC) accelerators. In this paper, we search for fast and accurate CNN model famil… ▽ More Neural Architecture Search (NAS), together with model scaling, has shown remarkable progress in designing high accuracy and fast convolutional architecture families. However, as neither NAS nor model scaling considers sufficient hardware architecture details, they do not take full advantage of the emerging datacenter (DC) accelerators. In this paper, we search for fast and accurate CNN model families for efficient inference on DC accelerators. We first analyze DC accelerators and find that existing CNNs suffer from insufficient operational intensity, parallelism, and execution efficiency. These insights let us create a DC-accelerator-optimized search space, with space-to-depth, space-to-batch, hybrid fused convolution structures with vanilla and depthwise convolutions, and block-wise activation functions. On top of our DC accelerator optimized neural architecture search space, we further propose a latency-aware compound scaling (LACS), the first multi-objective compound scaling method optimizing both accuracy and latency. Our LACS discovers that network depth should grow much faster than image size and network width, which is quite different from previous compound scaling results. With the new search space and LACS, our search and scaling on datacenter accelerators results in a new model series named EfficientNet-X. EfficientNet-X is up to more than 2X faster than EfficientNet (a model series with state-of-the-art trade-off on FLOPs and accuracy) on TPUv3 and GPUv100, with comparable accuracy. EfficientNet-X is also up to 7X faster than recent RegNet and ResNeSt on TPUv3 and GPUv100. △ Less

Submitted 10 February, 2021; originally announced February 2021.

arXiv:2012.11736 [pdf, ps, other]

Energy Efficiency Maximization in RIS-Aided Cell-Free Network with Limited Backhaul

Authors: Quang Nhat Le, Van-Dinh Nguyen, Octavia A. Dobre, Ruiqin Zhao

Abstract: Integrating the reconfigurable intelligent surface in a cell-free (RIS-CF) network is an effective solution to improve the capacity and coverage of future wireless systems with low cost and power consumption. The reflecting coefficients of RISs can be programmed to enhance signals received at users. This letter addresses a joint design of transmit beamformers at access points and reflecting coeffi… ▽ More Integrating the reconfigurable intelligent surface in a cell-free (RIS-CF) network is an effective solution to improve the capacity and coverage of future wireless systems with low cost and power consumption. The reflecting coefficients of RISs can be programmed to enhance signals received at users. This letter addresses a joint design of transmit beamformers at access points and reflecting coefficients at RISs to maximize the energy efficiency (EE) of RIS-CF networks, taking into account the limited backhaul capacity constraints. Due to a very computationally challenging nonconvex problem, we develop a simple yet efficient alternating descent algorithm for its solution. Numerical results verify that the EE of RIS-CF networks is greatly improved, showing the benefit of using RISs. △ Less

Submitted 8 March, 2021; v1 submitted 21 December, 2020; originally announced December 2020.

Comments: submitted for possible publication

arXiv:2011.10133 [pdf, ps, other]

doi 10.1109/TGCN.2020.3036026

Full-Duplex Non-Orthogonal Multiple Access Cooperative Overlay Spectrum-Sharing Networks with SWIPT

Authors: Quang Nhat Le, Animesh Yadav, Nam-Phong Nguyen, Octavia A. Dobre, Ruiqin Zhao

Abstract: This paper proposes a novel non-orthogonal multiple access (NOMA) assisted cooperative spectrum sharing network, in which one of the full-duplex (FD) secondary transmitters (STs) is chosen among many for forwarding the primary transmitter's and its own information to primary receiver and secondary receivers, respectively, using NOMA technique. To stimulate the ST to conduct cooperative transmissio… ▽ More This paper proposes a novel non-orthogonal multiple access (NOMA) assisted cooperative spectrum sharing network, in which one of the full-duplex (FD) secondary transmitters (STs) is chosen among many for forwarding the primary transmitter's and its own information to primary receiver and secondary receivers, respectively, using NOMA technique. To stimulate the ST to conduct cooperative transmission and sustain its operations, the simultaneous wireless information and power transfer (SWIPT) technique is utilized by the ST to harvest the primary signal's energy. In order to evaluate the proposed system's performance, the outage probability and system throughput for the primary and secondary networks are derived in tight closed-form approximations. Further, the sum rate optimization problem is formulated for the proposed cooperative network and a rapid convergent iterative algorithm is proposed to obtain the optimized power allocation coefficients. Numerical results show that FD, SWIPT, and NOMA techniques greatly boost the performance of cooperative spectrum-sharing network in terms of outage probability, system throughput, and sum rate compared to that of half-duplex NOMA and the conventional orthogonal multiple access-time division multiple access networks. △ Less

Submitted 19 November, 2020; originally announced November 2020.

Comments: accepted for publication in the IEEE Transactions on Green Communications and Networking

arXiv:2011.07549 [pdf, ps, other]

Learning-Assisted User Clustering in Cell-Free Massive MIMO-NOMA Networks

Authors: Quang Nhat Le, Van-Dinh Nguyen, Nam-Phong Nguyen, Symeon Chatzinotas, Octavia A. Dobre, Ruiqin Zhao

Abstract: The superior spectral efficiency (SE) and user fairness feature of non-orthogonal multiple access (NOMA) systems are achieved by exploiting user clustering (UC) more efficiently. However, a random UC certainly results in a suboptimal solution while an exhaustive search method comes at the cost of high complexity, especially for systems of medium-to-large size. To address this problem, we develop t… ▽ More The superior spectral efficiency (SE) and user fairness feature of non-orthogonal multiple access (NOMA) systems are achieved by exploiting user clustering (UC) more efficiently. However, a random UC certainly results in a suboptimal solution while an exhaustive search method comes at the cost of high complexity, especially for systems of medium-to-large size. To address this problem, we develop two efficient unsupervised machine learning (ML) based UC algorithms, namely k-means++ and improved k-means++, to effectively cluster users into disjoint clusters in cell-free massive multiple-input multiple-output (CFmMIMO) system. Using full-pilot zero-forcing at access points, we derive the sum SE in closed-form expression taking into account the impact of intra-cluster pilot contamination, inter-cluster interference, and imperfect successive interference cancellation. To comprehensively assess the system performance, we formulate the sum SE optimization problem, and then develop a simple yet efficient iterative algorithm for its solution. In addition, the performance of collocated massive MIMO-NOMA (COmMIMO-NOMA) system is also characterized. Numerical results are provided to show the superior performance of the proposed UC algorithms compared to other baseline schemes. The effectiveness of applying NOMA in CFmMIMO and COmMIMO systems is also validated. △ Less

Submitted 15 November, 2020; originally announced November 2020.

Comments: submitted for possible publication

arXiv:2010.10504 [pdf, other]

Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition

Authors: Yu Zhang, James Qin, Daniel S. Park, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Quoc V. Le, Yonghui Wu

Abstract: We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled audio of the Libri-Light dataset. More precisely, we carry out noisy student training with SpecAugment using giant Conformer models pre-trained using wav2vec 2.0 pre-training. By doing so, we are able to achieve word-e… ▽ More We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled audio of the Libri-Light dataset. More precisely, we carry out noisy student training with SpecAugment using giant Conformer models pre-trained using wav2vec 2.0 pre-training. By doing so, we are able to achieve word-error-rates (WERs) 1.4%/2.6% on the LibriSpeech test/test-other sets against the current state-of-the-art WERs 1.7%/3.3%. △ Less

Submitted 20 July, 2022; v1 submitted 20 October, 2020; originally announced October 2020.

Comments: 11 pages, 3 figures, 5 tables. Accepted to NeurIPS SAS 2020 Workshop; v2: minor errors corrected

arXiv:2008.06828 [pdf, other]

A novel approach to remove foreign objects from chest X-ray images

Authors: Hieu X. Le, Phuong D. Nguyen, Thang H. Nguyen, Khanh N. Q. Le, Thanh T. Nguyen

Abstract: We initially proposed a deep learning approach for foreign objects inpainting in smartphone-camera captured chest radiographs utilizing the cheXphoto dataset. Foreign objects which can significantly affect the quality of a computer-aided diagnostic prediction are captured under various settings. In this paper, we used multi-method to tackle both removal and inpainting chest radiographs. Firstly, a… ▽ More We initially proposed a deep learning approach for foreign objects inpainting in smartphone-camera captured chest radiographs utilizing the cheXphoto dataset. Foreign objects which can significantly affect the quality of a computer-aided diagnostic prediction are captured under various settings. In this paper, we used multi-method to tackle both removal and inpainting chest radiographs. Firstly, an object detection model is trained to separate the foreign objects from the given image. Subsequently, the binary mask of each object is extracted utilizing a segmentation model. Each pair of the binary mask and the extracted object are then used for inpainting purposes. Finally, the in-painted regions are now merged back to the original image, resulting in a clean and non-foreign-object-existing output. To conclude, we achieved state-of-the-art accuracy. The experimental results showed a new approach to the possible applications of this method for chest X-ray images detection. △ Less

Submitted 15 August, 2020; originally announced August 2020.

Comments: 9 pages, 7 figures, 7 tables

arXiv:2005.09629 [pdf, other]

doi 10.21437/Interspeech.2020-1470

Improved Noisy Student Training for Automatic Speech Recognition

Authors: Daniel S. Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, Quoc V. Le

Abstract: Recently, a semi-supervised learning method known as "noisy student training" has been shown to improve image classification performance of deep networks significantly. Noisy student training is an iterative self-training method that leverages augmentation to improve network performance. In this work, we adapt and improve noisy student training for automatic speech recognition, employing (adaptive… ▽ More Recently, a semi-supervised learning method known as "noisy student training" has been shown to improve image classification performance of deep networks significantly. Noisy student training is an iterative self-training method that leverages augmentation to improve network performance. In this work, we adapt and improve noisy student training for automatic speech recognition, employing (adaptive) SpecAugment as the augmentation method. We find effective methods to filter, balance and augment the data generated in between self-training iterations. By doing so, we are able to obtain word error rates (WERs) 4.2%/8.6% on the clean/noisy LibriSpeech test sets by only using the clean 100h subset of LibriSpeech as the supervised set and the rest (860h) as the unlabeled set. Furthermore, we are able to achieve WERs 1.7%/3.4% on the clean/noisy LibriSpeech test sets by using the unlab-60k subset of LibriLight as the unlabeled set for LibriSpeech 960h. We are thus able to improve upon the previous state-of-the-art clean/noisy test WERs achieved on LibriSpeech 100h (4.74%/12.20%) and LibriSpeech (1.9%/4.1%). △ Less

Submitted 29 October, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

Comments: 5 pages, 5 figures, 4 tables; v2: minor revisions, reference added

Journal ref: Proc. Interspeech 2020, 2817-2821

arXiv:1912.05533 [pdf, ps, other]

SpecAugment on Large Scale Datasets

Authors: Daniel S. Park, Yu Zhang, Chung-Cheng Chiu, Youzheng Chen, Bo Li, William Chan, Quoc V. Le, Yonghui Wu

Abstract: Recently, SpecAugment, an augmentation scheme for automatic speech recognition that acts directly on the spectrogram of input utterances, has shown to be highly effective in enhancing the performance of end-to-end networks on public datasets. In this paper, we demonstrate its effectiveness on tasks with large scale datasets by investigating its application to the Google Multidomain Dataset (Naraya… ▽ More Recently, SpecAugment, an augmentation scheme for automatic speech recognition that acts directly on the spectrogram of input utterances, has shown to be highly effective in enhancing the performance of end-to-end networks on public datasets. In this paper, we demonstrate its effectiveness on tasks with large scale datasets by investigating its application to the Google Multidomain Dataset (Narayanan et al., 2018). We achieve improvement across all test domains by mixing raw training data augmented with SpecAugment and noise-perturbed training data when training the acoustic model. We also introduce a modification of SpecAugment that adapts the time mask size and/or multiplicity depending on the length of the utterance, which can potentially benefit large scale tasks. By using adaptive masking, we are able to further improve the performance of the Listen, Attend and Spell model on LibriSpeech to 2.2% WER on test-clean and 5.2% WER on test-other. △ Less

Submitted 11 December, 2019; originally announced December 2019.

Comments: 5 pages, 3 tables; submitted to ICASSP 2020

arXiv:1912.05027 [pdf, other]

SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization

Authors: Xianzhi Du, Tsung-Yi Lin, Pengchong **, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V. Le, Xiaodan Song

Abstract: Convolutional neural networks typically encode an input image into a series of intermediate features with decreasing resolutions. While this structure is suited to classification tasks, it does not perform well for tasks requiring simultaneous recognition and localization (e.g., object detection). The encoder-decoder architectures are proposed to resolve this by applying a decoder network onto a b… ▽ More Convolutional neural networks typically encode an input image into a series of intermediate features with decreasing resolutions. While this structure is suited to classification tasks, it does not perform well for tasks requiring simultaneous recognition and localization (e.g., object detection). The encoder-decoder architectures are proposed to resolve this by applying a decoder network onto a backbone model designed for classification tasks. In this paper, we argue encoder-decoder architecture is ineffective in generating strong multi-scale features because of the scale-decreased backbone. We propose SpineNet, a backbone with scale-permuted intermediate features and cross-scale connections that is learned on an object detection task by Neural Architecture Search. Using similar building blocks, SpineNet models outperform ResNet-FPN models by ~3% AP at various scales while using 10-20% fewer FLOPs. In particular, SpineNet-190 achieves 52.5% AP with a MaskR-CNN detector and achieves 52.1% AP with a RetinaNet detector on COCO for a single model without test-time augmentation, significantly outperforms prior art of detectors. SpineNet can transfer to classification tasks, achieving 5% top-1 accuracy improvement on a challenging iNaturalist fine-grained dataset. Code is at: https://github.com/tensorflow/tpu/tree/master/models/official/detection. △ Less

Submitted 17 June, 2020; v1 submitted 10 December, 2019; originally announced December 2019.

Comments: CVPR 2020

arXiv:1911.09070 [pdf, other]

EfficientDet: Scalable and Efficient Object Detection

Authors: Mingxing Tan, Ruoming Pang, Quoc V. Le

Abstract: Model efficiency has become increasingly important in computer vision. In this paper, we systematically study neural network architecture design choices for object detection and propose several key optimizations to improve efficiency. First, we propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multiscale feature fusion; Second, we propose a compound scal… ▽ More Model efficiency has become increasingly important in computer vision. In this paper, we systematically study neural network architecture design choices for object detection and propose several key optimizations to improve efficiency. First, we propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multiscale feature fusion; Second, we propose a compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time. Based on these optimizations and better backbones, we have developed a new family of object detectors, called EfficientDet, which consistently achieve much better efficiency than prior art across a wide spectrum of resource constraints. In particular, with single model and single-scale, our EfficientDet-D7 achieves state-of-the-art 55.1 AP on COCO test-dev with 77M parameters and 410B FLOPs, being 4x - 9x smaller and using 13x - 42x fewer FLOPs than previous detectors. Code is available at https://github.com/google/automl/tree/master/efficientdet. △ Less

Submitted 27 July, 2020; v1 submitted 20 November, 2019; originally announced November 2019.

Comments: CVPR 2020

Journal ref: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)

arXiv:1910.04971 [pdf, other]

Autonomous Shuttles for Last-Mile Connectivity

Authors: Garrison Neel, Amir Darwesh, Quang Le, Srikanth Saripalli

Abstract: This paper describes an autonomous shuttle which targets providing last-mile transportation. Often, this involves operation in crowded areas with high levels of pedestrian traffic, and little to no lane markings or traffic control. We aim to create a functional shuttle to be improved upon in the future as new robust solutions are developed to replace the current components. An initial implementati… ▽ More This paper describes an autonomous shuttle which targets providing last-mile transportation. Often, this involves operation in crowded areas with high levels of pedestrian traffic, and little to no lane markings or traffic control. We aim to create a functional shuttle to be improved upon in the future as new robust solutions are developed to replace the current components. An initial implementation of such a shuttle presented, detailing the overall architecture, controller structure, waypoint following, obstacle detection and avoidance, LiDAR based sign detection, and pedestrian communication. The performance of each component is evaluated, and future improvements are discussed. △ Less

Submitted 11 October, 2019; originally announced October 2019.

arXiv:1906.02940 [pdf, other]

Selfie: Self-supervised Pretraining for Image Embedding

Authors: Trieu H. Trinh, Minh-Thang Luong, Quoc V. Le

Abstract: We introduce a pretraining technique called Selfie, which stands for SELFie supervised Image Embedding. Selfie generalizes the concept of masked language modeling of BERT (Devlin et al., 2019) to continuous data, such as images, by making use of the Contrastive Predictive Coding loss (Oord et al., 2018). Given masked-out patches in an input image, our method learns to select the correct patch, amo… ▽ More We introduce a pretraining technique called Selfie, which stands for SELFie supervised Image Embedding. Selfie generalizes the concept of masked language modeling of BERT (Devlin et al., 2019) to continuous data, such as images, by making use of the Contrastive Predictive Coding loss (Oord et al., 2018). Given masked-out patches in an input image, our method learns to select the correct patch, among other "distractor" patches sampled from the same image, to fill in the masked location. This classification objective sidesteps the need for predicting exact pixel values of the target patches. The pretraining architecture of Selfie includes a network of convolutional blocks to process patches followed by an attention pooling network to summarize the content of unmasked patches before predicting masked ones. During finetuning, we reuse the convolutional weights found by pretraining. We evaluate Selfie on three benchmarks (CIFAR-10, ImageNet 32 x 32, and ImageNet 224 x 224) with varying amounts of labeled data, from 5% to 100% of the training sets. Our pretraining method provides consistent improvements to ResNet-50 across all settings compared to the standard supervised training of the same network. Notably, on ImageNet 224 x 224 with 60 examples per class (5%), our method improves the mean accuracy of ResNet-50 from 35.6% to 46.7%, an improvement of 11.1 points in absolute accuracy. Our pretraining method also improves ResNet-50 training stability, especially on low data regime, by significantly lowering the standard deviation of test accuracies across different runs. △ Less

Submitted 27 July, 2019; v1 submitted 7 June, 2019; originally announced June 2019.

arXiv:1904.08779 [pdf, other]

doi 10.21437/Interspeech.2019-2680

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Authors: Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, Quoc V. Le

Abstract: We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of war** the features, masking blocks of frequency channels, and masking blocks of time steps. We apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech… ▽ More We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of war** the features, masking blocks of frequency channels, and masking blocks of time steps. We apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech recognition tasks. We achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work. On LibriSpeech, we achieve 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. This compares to the previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, we achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5'00 test set without the use of a language model, and 6.8%/14.1% with shallow fusion, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER. △ Less

Submitted 3 December, 2019; v1 submitted 18 April, 2019; originally announced April 2019.

Comments: 5 pages, 3 figures, 6 tables; v3: references added

Journal ref: Proc. Interspeech 2019, 2613-2617

Showing 1–19 of 19 results for author: Le, Q