Search | arXiv e-print repository

arXiv:2011.03395 [pdf, other]

Underspecification Presents Challenges for Credibility in Modern Machine Learning

Authors: Alexander D'Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yian Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary Nado, Vivek Natarajan, Christopher Nielson, Thomas F. Osborne , et al. (15 additional authors not shown)

Abstract: ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predict… ▽ More ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain. △ Less

Submitted 24 November, 2020; v1 submitted 6 November, 2020; originally announced November 2020.

Comments: Updates: Updated statistical analysis in Section 6; Additional citations

arXiv:2004.04894 [pdf]

doi 10.1016/j.eswa.2021.114809

Fully Automatic Electrocardiogram Classification System based on Generative Adversarial Network with Auxiliary Classifier

Authors: Zhanhong Zhou, Xiaolong Zhai, Chung Tin

Abstract: A generative adversarial network (GAN) based fully automatic electrocardiogram (ECG) arrhythmia classification system with high performance is presented in this paper. The generator (G) in our GAN is designed to generate various coupling matrix inputs conditioned on different arrhythmia classes for data augmentation. Our designed discriminator (D) is trained on both real and generated ECG coupling… ▽ More A generative adversarial network (GAN) based fully automatic electrocardiogram (ECG) arrhythmia classification system with high performance is presented in this paper. The generator (G) in our GAN is designed to generate various coupling matrix inputs conditioned on different arrhythmia classes for data augmentation. Our designed discriminator (D) is trained on both real and generated ECG coupling matrix inputs, and is extracted as an arrhythmia classifier upon completion of training for our GAN. After fine-tuning the D by including patient-specific normal beats estimated using an unsupervised algorithm, and generated abnormal beats by G that are usually rare to obtain, our fully automatic system showed superior overall classification performance for both supraventricular ectopic beats (SVEB or S beats) and ventricular ectopic beats (VEB or V beats) on the MIT-BIH arrhythmia database. It surpassed several state-of-art automatic classifiers and can perform on similar levels as some expert-assisted methods. In particular, the F1 score of SVEB has been improved by up to 13% over the top-performing automatic systems. Moreover, high sensitivity for both SVEB (87%) and VEB (93%) detection has been achieved, which is of great value for practical diagnosis. We, therefore, suggest our ACE-GAN (Generative Adversarial Network with Auxiliary Classifier for Electrocardiogram) based automatic system can be a promising and reliable tool for high throughput clinical screening practice, without any need of manual intervene or expert assisted labeling. △ Less

Submitted 4 March, 2021; v1 submitted 9 April, 2020; originally announced April 2020.

Comments: Accepted for publication in Expert Systems with Applications

Journal ref: Expert Systems with Applications, Volume 174, 2021, 114809, ISSN 0957-4174

arXiv:1910.04867 [pdf, other]

A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark

Authors: Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, Lucas Beyer, Olivier Bachem, Michael Tschannen, Marcin Michalski, Olivier Bousquet, Sylvain Gelly, Neil Houlsby

Abstract: Representation learning promises to unlock deep learning for the long tail of vision tasks without expensive labelled datasets. Yet, the absence of a unified evaluation for general visual representations hinders progress. Popular protocols are often too constrained (linear classification), limited in diversity (ImageNet, CIFAR, Pascal-VOC), or only weakly related to representation quality (ELBO, r… ▽ More Representation learning promises to unlock deep learning for the long tail of vision tasks without expensive labelled datasets. Yet, the absence of a unified evaluation for general visual representations hinders progress. Popular protocols are often too constrained (linear classification), limited in diversity (ImageNet, CIFAR, Pascal-VOC), or only weakly related to representation quality (ELBO, reconstruction error). We present the Visual Task Adaptation Benchmark (VTAB), which defines good representations as those that adapt to diverse, unseen tasks with few examples. With VTAB, we conduct a large-scale study of many popular publicly-available representation learning algorithms. We carefully control confounders such as architecture and tuning budget. We address questions like: How effective are ImageNet representations beyond standard natural datasets? How do representations trained via generative and discriminative models compare? To what extent can self-supervision replace labels? And, how close are we to general visual representations? △ Less

Submitted 21 February, 2020; v1 submitted 1 October, 2019; originally announced October 2019.

arXiv:1908.10292 [pdf, other]

On the Multiple Descent of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels

Authors: Tengyuan Liang, Alexander Rakhlin, Xiyu Zhai

Abstract: We study the risk of minimum-norm interpolants of data in Reproducing Kernel Hilbert Spaces. Our upper bounds on the risk are of a multiple-descent shape for the various scalings of $d = n^α$, $α\in(0,1)$, for the input dimension $d$ and sample size $n$. Empirical evidence supports our finding that minimum-norm interpolants in RKHS can exhibit this unusual non-monotonicity in sample size; furtherm… ▽ More We study the risk of minimum-norm interpolants of data in Reproducing Kernel Hilbert Spaces. Our upper bounds on the risk are of a multiple-descent shape for the various scalings of $d = n^α$, $α\in(0,1)$, for the input dimension $d$ and sample size $n$. Empirical evidence supports our finding that minimum-norm interpolants in RKHS can exhibit this unusual non-monotonicity in sample size; furthermore, locations of the peaks in our experiments match our theoretical predictions. Since gradient flow on appropriately initialized wide neural networks converges to a minimum-norm interpolant with respect to a certain kernel, our analysis also yields novel estimation and generalization guarantees for these over-parametrized models. At the heart of our analysis is a study of spectral properties of the random kernel matrix restricted to a filtration of eigen-spaces of the population covariance operator, and may be of independent interest. △ Less

Submitted 3 February, 2020; v1 submitted 27 August, 2019; originally announced August 2019.

Journal ref: Proceedings of the 33rd Conference on Learning Theory 125 (2020) 2683-2711

arXiv:1906.11289

Near Optimal Stratified Sampling

Authors: Tiancheng Yu, Xiyu Zhai, Suvrit Sra

Abstract: The performance of a machine learning system is usually evaluated by using i.i.d.\ observations with true labels. However, acquiring ground truth labels is expensive, while obtaining unlabeled samples may be cheaper. Stratified sampling can be beneficial in such settings and can reduce the number of true labels required without compromising the evaluation accuracy. Stratified sampling exploits sta… ▽ More The performance of a machine learning system is usually evaluated by using i.i.d.\ observations with true labels. However, acquiring ground truth labels is expensive, while obtaining unlabeled samples may be cheaper. Stratified sampling can be beneficial in such settings and can reduce the number of true labels required without compromising the evaluation accuracy. Stratified sampling exploits statistical properties (e.g., variance) across strata of the unlabeled population, though usually under the unrealistic assumption that these properties are known. We propose two new algorithms that simultaneously estimate these properties and optimize the evaluation accuracy. We construct a lower bound to show the proposed algorithms (to log-factors) are rate optimal. Experiments on synthetic and real data show the reduction in label complexity that is enabled by our algorithms. △ Less

Submitted 26 July, 2019; v1 submitted 26 June, 2019; originally announced June 2019.

Comments: We have discovered a mistake in the main result. The quantity on the RHS of (3) is not equal to the variance of estimator (2) when the sampling rule is designed adaptively as we do. There will be further cross-product terms which are now dominant terms. Therefore, although our bound is correct for (3), it no longer implies bound of the variance of (2)

arXiv:1903.02271 [pdf, other]

High-Fidelity Image Generation With Fewer Labels

Authors: Mario Lucic, Michael Tschannen, Marvin Ritter, Xiaohua Zhai, Olivier Bachem, Sylvain Gelly

Abstract: Deep generative models are becoming a cornerstone of modern machine learning. Recent work on conditional generative adversarial networks has shown that learning complex, high-dimensional distributions over natural images is within reach. While the latest models are able to generate high-fidelity, diverse natural images at high resolution, they rely on a vast quantity of labeled data. In this work… ▽ More Deep generative models are becoming a cornerstone of modern machine learning. Recent work on conditional generative adversarial networks has shown that learning complex, high-dimensional distributions over natural images is within reach. While the latest models are able to generate high-fidelity, diverse natural images at high resolution, they rely on a vast quantity of labeled data. In this work we demonstrate how one can benefit from recent work on self- and semi-supervised learning to outperform the state of the art on both unsupervised ImageNet synthesis, as well as in the conditional setting. In particular, the proposed approach is able to match the sample quality (as measured by FID) of the current state-of-the-art conditional model BigGAN on ImageNet using only 10% of the labels and outperform it using 20% of the labels. △ Less

Submitted 14 May, 2019; v1 submitted 6 March, 2019; originally announced March 2019.

Comments: Mario Lucic, Michael Tschannen, and Marvin Ritter contributed equally to this work. ICML 2019 camera-ready version. Code available at https://github.com/google/compare_gan

arXiv:1812.11167 [pdf, ps, other]

Consistency of Interpolation with Laplace Kernels is a High-Dimensional Phenomenon

Authors: Alexander Rakhlin, Xiyu Zhai

Abstract: We show that minimum-norm interpolation in the Reproducing Kernel Hilbert Space corresponding to the Laplace kernel is not consistent if input dimension is constant. The lower bound holds for any choice of kernel bandwidth, even if selected based on data. The result supports the empirical observation that minimum-norm interpolation (that is, exact fit to training data) in RKHS generalizes well for… ▽ More We show that minimum-norm interpolation in the Reproducing Kernel Hilbert Space corresponding to the Laplace kernel is not consistent if input dimension is constant. The lower bound holds for any choice of kernel bandwidth, even if selected based on data. The result supports the empirical observation that minimum-norm interpolation (that is, exact fit to training data) in RKHS generalizes well for some high-dimensional datasets, but not for low-dimensional ones. △ Less

Submitted 28 December, 2018; originally announced December 2018.

arXiv:1811.11212 [pdf, other]

Self-Supervised GANs via Auxiliary Rotation Loss

Authors: Ting Chen, Xiaohua Zhai, Marvin Ritter, Mario Lucic, Neil Houlsby

Abstract: Conditional GANs are at the forefront of natural image synthesis. The main drawback of such models is the necessity for labeled data. In this work we exploit two popular unsupervised learning techniques, adversarial training and self-supervision, and take a step towards bridging the gap between conditional and unconditional GANs. In particular, we allow the networks to collaborate on the task of r… ▽ More Conditional GANs are at the forefront of natural image synthesis. The main drawback of such models is the necessity for labeled data. In this work we exploit two popular unsupervised learning techniques, adversarial training and self-supervision, and take a step towards bridging the gap between conditional and unconditional GANs. In particular, we allow the networks to collaborate on the task of representation learning, while being adversarial with respect to the classic GAN game. The role of self-supervision is to encourage the discriminator to learn meaningful feature representations which are not forgotten during training. We test empirically both the quality of the learned image representations, and the quality of the synthesized images. Under the same conditions, the self-supervised GAN attains a similar performance to state-of-the-art conditional counterparts. Finally, we show that this approach to fully unsupervised learning can be scaled to attain an FID of 23.4 on unconditional ImageNet generation. △ Less

Submitted 9 April, 2019; v1 submitted 27 November, 2018; originally announced November 2018.

arXiv:1811.03804 [pdf, ps, other]

Gradient Descent Finds Global Minima of Deep Neural Networks

Authors: Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, Xiyu Zhai

Abstract: Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure of the Gram matrix induced by the neural network architectur… ▽ More Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure of the Gram matrix induced by the neural network architecture. This structure allows us to show the Gram matrix is stable throughout the training process and this stability implies the global optimality of the gradient descent algorithm. We further extend our analysis to deep residual convolutional neural networks and obtain a similar convergence result. △ Less

Submitted 28 May, 2019; v1 submitted 9 November, 2018; originally announced November 2018.

Comments: ICML 2019

arXiv:1810.11598 [pdf, other]

Self-Supervised GAN to Counter Forgetting

Authors: Ting Chen, Xiaohua Zhai, Neil Houlsby

Abstract: GANs involve training two networks in an adversarial game, where each network's task depends on its adversary. Recently, several works have framed GAN training as an online or continual learning problem. We focus on the discriminator, which must perform classification under an (adversarially) shifting data distribution. When trained on sequential tasks, neural networks exhibit \emph{forgetting}. F… ▽ More GANs involve training two networks in an adversarial game, where each network's task depends on its adversary. Recently, several works have framed GAN training as an online or continual learning problem. We focus on the discriminator, which must perform classification under an (adversarially) shifting data distribution. When trained on sequential tasks, neural networks exhibit \emph{forgetting}. For GANs, discriminator forgetting leads to training instability. To counter forgetting, we encourage the discriminator to maintain useful representations by adding a self-supervision. Conditional GANs have a similar effect using labels. However, our self-supervised GAN does not require labels, and closes the performance gap between conditional and unconditional models. We show that, in doing so, the self-supervised discriminator learns better representations than regular GANs. △ Less

Submitted 29 November, 2018; v1 submitted 27 October, 2018; originally announced October 2018.

Comments: NeurIPS'18 Continual Learning workshop

arXiv:1810.02054 [pdf, other]

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

Authors: Simon S. Du, Xiyu Zhai, Barnabas Poczos, Aarti Singh

Abstract: One of the mysteries in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies this surprising phenomenon for two-layer fully connected ReLU activated neural networks. For an $m$ hidden node shallow neural network with ReLU activation and… ▽ More One of the mysteries in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies this surprising phenomenon for two-layer fully connected ReLU activated neural networks. For an $m$ hidden node shallow neural network with ReLU activation and $n$ training data, we show as long as $m$ is large enough and no two inputs are parallel, randomly initialized gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. Our analysis relies on the following observation: over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows us to exploit a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. We believe these insights are also useful in analyzing deep models and other first order methods. △ Less

Submitted 4 February, 2019; v1 submitted 4 October, 2018; originally announced October 2018.

Comments: ICLR 2019

arXiv:1807.04720 [pdf, other]

A Large-Scale Study on Regularization and Normalization in GANs

Authors: Karol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michalski, Sylvain Gelly

Abstract: Generative adversarial networks (GANs) are a class of deep generative models which aim to learn a target distribution in an unsupervised fashion. While they were successfully applied to many problems, training a GAN is a notoriously challenging task and requires a significant number of hyperparameter tuning, neural architecture engineering, and a non-trivial amount of "tricks". The success in many… ▽ More Generative adversarial networks (GANs) are a class of deep generative models which aim to learn a target distribution in an unsupervised fashion. While they were successfully applied to many problems, training a GAN is a notoriously challenging task and requires a significant number of hyperparameter tuning, neural architecture engineering, and a non-trivial amount of "tricks". The success in many practical applications coupled with the lack of a measure to quantify the failure modes of GANs resulted in a plethora of proposed losses, regularization and normalization schemes, as well as neural architectures. In this work we take a sober view of the current state of GANs from a practical perspective. We discuss and evaluate common pitfalls and reproducibility issues, open-source our code on Github, and provide pre-trained models on TensorFlow Hub. △ Less

Submitted 14 May, 2019; v1 submitted 12 July, 2018; originally announced July 2018.

Comments: Revision accepted to ICML'19: More focus on regularization and normalization aspects. Added recent references and promising future directions

arXiv:1805.07883 [pdf, other]

How Many Samples are Needed to Estimate a Convolutional or Recurrent Neural Network?

Authors: Simon S. Du, Yining Wang, Xiyu Zhai, Sivaraman Balakrishnan, Ruslan Salakhutdinov, Aarti Singh

Abstract: It is widely believed that the practical success of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) owes to the fact that CNNs and RNNs use a more compact parametric representation than their Fully-Connected Neural Network (FNN) counterparts, and consequently require fewer training examples to accurately estimate their parameters. We initiate the study of rigorously chara… ▽ More It is widely believed that the practical success of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) owes to the fact that CNNs and RNNs use a more compact parametric representation than their Fully-Connected Neural Network (FNN) counterparts, and consequently require fewer training examples to accurately estimate their parameters. We initiate the study of rigorously characterizing the sample-complexity of estimating CNNs and RNNs. We show that the sample-complexity to learn CNNs and RNNs scales linearly with their intrinsic dimension and this sample-complexity is much smaller than for their FNN counterparts. For both CNNs and RNNs, we also present lower bounds showing our sample complexities are tight up to logarithmic factors. Our main technical tools for deriving these results are a localized empirical process analysis and a new technical lemma characterizing the convolutional and recurrent structure. We believe that these tools may inspire further developments in understanding CNNs and RNNs. △ Less

Submitted 29 June, 2019; v1 submitted 20 May, 2018; originally announced May 2018.

Comments: Revised version, with new results on recurrent neural networks. Preliminary version in NeurIPS 2018

arXiv:1707.05947 [pdf, other]

Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints

Authors: Wenlong Mou, Liwei Wang, Xiyu Zhai, Kai Zheng

Abstract: Algorithm-dependent generalization error bounds are central to statistical learning theory. A learning algorithm may use a large hypothesis space, but the limited number of iterations controls its model capacity and generalization error. The impacts of stochastic gradient methods on generalization error for non-convex learning problems not only have important theoretical consequences, but are also… ▽ More Algorithm-dependent generalization error bounds are central to statistical learning theory. A learning algorithm may use a large hypothesis space, but the limited number of iterations controls its model capacity and generalization error. The impacts of stochastic gradient methods on generalization error for non-convex learning problems not only have important theoretical consequences, but are also critical to generalization errors of deep learning. In this paper, we study the generalization errors of Stochastic Gradient Langevin Dynamics (SGLD) with non-convex objectives. Two theories are proposed with non-asymptotic discrete-time analysis, using Stability and PAC-Bayesian results respectively. The stability-based theory obtains a bound of $O\left(\frac{1}{n}L\sqrt{βT_k}\right)$, where $L$ is uniform Lipschitz parameter, $β$ is inverse temperature, and $T_k$ is aggregated step sizes. For PAC-Bayesian theory, though the bound has a slower $O(1/\sqrt{n})$ rate, the contribution of each step is shown with an exponentially decaying factor by imposing $\ell^2$ regularization, and the uniform Lipschitz constant is also replaced by actual norms of gradients along trajectory. Our bounds have no implicit dependence on dimensions, norms or other capacity measures of parameter, which elegantly characterizes the phenomenon of "Fast Training Guarantees Generalization" in non-convex settings. This is the first algorithm-dependent result with reasonable dependence on aggregated step sizes for non-convex learning, and has important implications to statistical learning aspects of stochastic gradient methods in complicated models such as deep learning. △ Less

Submitted 19 July, 2017; originally announced July 2017.

Showing 1–14 of 14 results for author: Zhai, X