-
Pi-DUAL: Using Privileged Information to Distinguish Clean from Noisy Labels
Authors:
Ke Wang,
Guillermo Ortiz-Jimenez,
Rodolphe Jenatton,
Mark Collier,
Efi Kokiopoulou,
Pascal Frossard
Abstract:
Label noise is a pervasive problem in deep learning that often compromises the generalization performance of trained models. Recently, leveraging privileged information (PI) -- information available only during training but not at test time -- has emerged as an effective approach to mitigate this issue. Yet, existing PI-based methods have failed to consistently outperform their no-PI counterparts…
▽ More
Label noise is a pervasive problem in deep learning that often compromises the generalization performance of trained models. Recently, leveraging privileged information (PI) -- information available only during training but not at test time -- has emerged as an effective approach to mitigate this issue. Yet, existing PI-based methods have failed to consistently outperform their no-PI counterparts in terms of preventing overfitting to label noise. To address this deficiency, we introduce Pi-DUAL, an architecture designed to harness PI to distinguish clean from wrong labels. Pi-DUAL decomposes the output logits into a prediction term, based on conventional input features, and a noise-fitting term influenced solely by PI. A gating mechanism steered by PI adaptively shifts focus between these terms, allowing the model to implicitly separate the learning paths of clean and wrong labels. Empirically, Pi-DUAL achieves significant performance improvements on key PI benchmarks (e.g., +6.8% on ImageNet-PI), establishing a new state-of-the-art test set accuracy. Additionally, Pi-DUAL is a potent method for identifying noisy samples post-training, outperforming other strong methods at this task. Overall, Pi-DUAL is a simple, scalable and practical approach for mitigating the effects of label noise in a variety of real-world scenarios with PI.
△ Less
Submitted 28 May, 2024; v1 submitted 10 October, 2023;
originally announced October 2023.
-
Three Towers: Flexible Contrastive Learning with Pretrained Image Models
Authors:
Jannik Kossen,
Mark Collier,
Basil Mustafa,
Xiao Wang,
Xiaohua Zhai,
Lucas Beyer,
Andreas Steiner,
Jesse Berent,
Rodolphe Jenatton,
Efi Kokiopoulou
Abstract:
We introduce Three Towers (3T), a flexible method to improve the contrastive learning of vision-language models by incorporating pretrained image classifiers. While contrastive models are usually trained from scratch, LiT (Zhai et al., 2022) has recently shown performance gains from using pretrained classifier embeddings. However, LiT directly replaces the image tower with the frozen embeddings, e…
▽ More
We introduce Three Towers (3T), a flexible method to improve the contrastive learning of vision-language models by incorporating pretrained image classifiers. While contrastive models are usually trained from scratch, LiT (Zhai et al., 2022) has recently shown performance gains from using pretrained classifier embeddings. However, LiT directly replaces the image tower with the frozen embeddings, excluding any potential benefits from training the image tower contrastively. With 3T, we propose a more flexible strategy that allows the image tower to benefit from both pretrained embeddings and contrastive training. To achieve this, we introduce a third tower that contains the frozen pretrained embeddings, and we encourage alignment between this third tower and the main image-text towers. Empirically, 3T consistently improves over LiT and the CLIP-style from-scratch baseline for retrieval tasks. For classification, 3T reliably improves over the from-scratch baseline, and while it underperforms relative to LiT for JFT-pretrained models, it outperforms LiT for ImageNet-21k and Places365 pretraining.
△ Less
Submitted 30 October, 2023; v1 submitted 26 May, 2023;
originally announced May 2023.
-
When does Privileged Information Explain Away Label Noise?
Authors:
Guillermo Ortiz-Jimenez,
Mark Collier,
Anant Nawalgaria,
Alexander D'Amour,
Jesse Berent,
Rodolphe Jenatton,
Effrosyni Kokiopoulou
Abstract:
Leveraging privileged information (PI), or features available during training but not at test time, has recently been shown to be an effective method for addressing label noise. However, the reasons for its effectiveness are not well understood. In this study, we investigate the role played by different properties of the PI in explaining away label noise. Through experiments on multiple datasets w…
▽ More
Leveraging privileged information (PI), or features available during training but not at test time, has recently been shown to be an effective method for addressing label noise. However, the reasons for its effectiveness are not well understood. In this study, we investigate the role played by different properties of the PI in explaining away label noise. Through experiments on multiple datasets with real PI (CIFAR-N/H) and a new large-scale benchmark ImageNet-PI, we find that PI is most helpful when it allows networks to easily distinguish clean from noisy data, while enabling a learning shortcut to memorize the noisy examples. Interestingly, when PI becomes too predictive of the target label, PI methods often perform worse than their no-PI baselines. Based on these findings, we propose several enhancements to the state-of-the-art PI methods and demonstrate the potential of PI as a means of tackling label noise. Finally, we show how we can easily combine the resulting PI approaches with existing no-PI techniques designed to deal with label noise.
△ Less
Submitted 1 June, 2023; v1 submitted 3 March, 2023;
originally announced March 2023.
-
Scaling Vision Transformers to 22 Billion Parameters
Authors:
Mostafa Dehghani,
Josip Djolonga,
Basil Mustafa,
Piotr Padlewski,
Jonathan Heek,
Justin Gilmer,
Andreas Steiner,
Mathilde Caron,
Robert Geirhos,
Ibrahim Alabdulmohsin,
Rodolphe Jenatton,
Lucas Beyer,
Michael Tschannen,
Anurag Arnab,
Xiao Wang,
Carlos Riquelme,
Matthias Minderer,
Joan Puigcerver,
Utku Evci,
Manoj Kumar,
Sjoerd van Steenkiste,
Gamaleldin F. Elsayed,
Aravindh Mahendran,
Fisher Yu,
Avital Oliver
, et al. (17 additional authors not shown)
Abstract:
The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al…
▽ More
The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.
△ Less
Submitted 10 February, 2023;
originally announced February 2023.
-
Massively Scaling Heteroscedastic Classifiers
Authors:
Mark Collier,
Rodolphe Jenatton,
Basil Mustafa,
Neil Houlsby,
Jesse Berent,
Effrosyni Kokiopoulou
Abstract:
Heteroscedastic classifiers, which learn a multivariate Gaussian distribution over prediction logits, have been shown to perform well on image classification problems with hundreds to thousands of classes. However, compared to standard classifiers, they introduce extra parameters that scale linearly with the number of classes. This makes them infeasible to apply to larger-scale problems. In additi…
▽ More
Heteroscedastic classifiers, which learn a multivariate Gaussian distribution over prediction logits, have been shown to perform well on image classification problems with hundreds to thousands of classes. However, compared to standard classifiers, they introduce extra parameters that scale linearly with the number of classes. This makes them infeasible to apply to larger-scale problems. In addition heteroscedastic classifiers introduce a critical temperature hyperparameter which must be tuned. We propose HET-XL, a heteroscedastic classifier whose parameter count when compared to a standard classifier scales independently of the number of classes. In our large-scale settings, we show that we can remove the need to tune the temperature hyperparameter, by directly learning it on the training data. On large image classification datasets with up to 4B images and 30k classes our method requires 14X fewer additional parameters, does not require tuning the temperature on a held-out set and performs consistently better than the baseline heteroscedastic classifier. HET-XL improves ImageNet 0-shot classification in a multimodal contrastive learning setup which can be viewed as a 3.5 billion class classification problem.
△ Less
Submitted 30 January, 2023;
originally announced January 2023.
-
On the Adversarial Robustness of Mixture of Experts
Authors:
Joan Puigcerver,
Rodolphe Jenatton,
Carlos Riquelme,
Pranjal Awasthi,
Srinadh Bhojanapalli
Abstract:
Adversarial robustness is a key desirable property of neural networks. It has been empirically shown to be affected by their sizes, with larger networks being typically more robust. Recently, Bubeck and Sellke proved a lower bound on the Lipschitz constant of functions that fit the training data in terms of their number of parameters. This raises an interesting open question, do -- and can -- func…
▽ More
Adversarial robustness is a key desirable property of neural networks. It has been empirically shown to be affected by their sizes, with larger networks being typically more robust. Recently, Bubeck and Sellke proved a lower bound on the Lipschitz constant of functions that fit the training data in terms of their number of parameters. This raises an interesting open question, do -- and can -- functions with more parameters, but not necessarily more computational cost, have better robustness? We study this question for sparse Mixture of Expert models (MoEs), that make it possible to scale up the model size for a roughly constant computational cost. We theoretically show that under certain conditions on the routing and the structure of the data, MoEs can have significantly smaller Lipschitz constants than their dense counterparts. The robustness of MoEs can suffer when the highest weighted experts for an input implement sufficiently different functions. We next empirically evaluate the robustness of MoEs on ImageNet using adversarial attacks and show they are indeed more robust than dense models with the same computational cost. We make key observations showing the robustness of MoEs to the choice of experts, highlighting the redundancy of experts in models trained in practice.
△ Less
Submitted 18 October, 2022;
originally announced October 2022.
-
Plex: Towards Reliability using Pretrained Large Model Extensions
Authors:
Dustin Tran,
Jeremiah Liu,
Michael W. Dusenberry,
Du Phan,
Mark Collier,
Jie Ren,
Kehang Han,
Zi Wang,
Zelda Mariet,
Huiyi Hu,
Neil Band,
Tim G. J. Rudner,
Karan Singhal,
Zachary Nado,
Joost van Amersfoort,
Andreas Kirsch,
Rodolphe Jenatton,
Nithum Thain,
Honglin Yuan,
Kelly Buchanan,
Kevin Murphy,
D. Sculley,
Yarin Gal,
Zoubin Ghahramani,
Jasper Snoek
, et al. (1 additional authors not shown)
Abstract:
A recent trend in artificial intelligence is the use of pretrained models for language and vision tasks, which have achieved extraordinary performance but also puzzling failures. Probing these models' abilities in diverse ways is therefore critical to the field. In this paper, we explore the reliability of models, where we define a reliable model as one that not only achieves strong predictive per…
▽ More
A recent trend in artificial intelligence is the use of pretrained models for language and vision tasks, which have achieved extraordinary performance but also puzzling failures. Probing these models' abilities in diverse ways is therefore critical to the field. In this paper, we explore the reliability of models, where we define a reliable model as one that not only achieves strong predictive performance but also performs well consistently over many decision-making tasks involving uncertainty (e.g., selective prediction, open set recognition), robust generalization (e.g., accuracy and proper scoring rules such as log-likelihood on in- and out-of-distribution datasets), and adaptation (e.g., active learning, few-shot uncertainty). We devise 10 types of tasks over 40 datasets in order to evaluate different aspects of reliability on both vision and language domains. To improve reliability, we developed ViT-Plex and T5-Plex, pretrained large model extensions for vision and language modalities, respectively. Plex greatly improves the state-of-the-art across reliability tasks, and simplifies the traditional protocol as it improves the out-of-the-box performance and does not require designing scores or tuning the model for each task. We demonstrate scaling effects over model sizes up to 1B parameters and pretraining dataset sizes up to 4B examples. We also demonstrate Plex's capabilities on challenging tasks including zero-shot open set recognition, active learning, and uncertainty in conversational language understanding.
△ Less
Submitted 15 July, 2022;
originally announced July 2022.
-
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
Authors:
Basil Mustafa,
Carlos Riquelme,
Joan Puigcerver,
Rodolphe Jenatton,
Neil Houlsby
Abstract:
Large sparsely-activated models have obtained excellent performance in multiple domains. However, such models are typically trained on a single modality at a time. We present the Language-Image MoE, LIMoE, a sparse mixture of experts model capable of multimodal learning. LIMoE accepts both images and text simultaneously, while being trained using a contrastive loss. MoEs are a natural fit for a mu…
▽ More
Large sparsely-activated models have obtained excellent performance in multiple domains. However, such models are typically trained on a single modality at a time. We present the Language-Image MoE, LIMoE, a sparse mixture of experts model capable of multimodal learning. LIMoE accepts both images and text simultaneously, while being trained using a contrastive loss. MoEs are a natural fit for a multimodal backbone, since expert layers can learn an appropriate partitioning of modalities. However, new challenges arise; in particular, training stability and balanced expert utilization, for which we propose an entropy-based regularization scheme. Across multiple scales, we demonstrate remarkable performance improvement over dense models of equivalent computational cost. LIMoE-L/16 trained comparably to CLIP-L/14 achieves 78.6% zero-shot ImageNet accuracy (vs. 76.2%), and when further scaled to H/14 (with additional data) it achieves 84.1%, comparable to state-of-the-art methods which use larger custom per-modality backbones and pre-training schemes. We analyse the quantitative and qualitative behavior of LIMoE, and demonstrate phenomena such as differing treatment of the modalities and the organic emergence of modality-specific experts.
△ Less
Submitted 6 June, 2022;
originally announced June 2022.
-
Transfer and Marginalize: Explaining Away Label Noise with Privileged Information
Authors:
Mark Collier,
Rodolphe Jenatton,
Efi Kokiopoulou,
Jesse Berent
Abstract:
Supervised learning datasets often have privileged information, in the form of features which are available at training time but are not available at test time e.g. the ID of the annotator that provided the label. We argue that privileged information is useful for explaining away label noise, thereby reducing the harmful impact of noisy labels. We develop a simple and efficient method for supervis…
▽ More
Supervised learning datasets often have privileged information, in the form of features which are available at training time but are not available at test time e.g. the ID of the annotator that provided the label. We argue that privileged information is useful for explaining away label noise, thereby reducing the harmful impact of noisy labels. We develop a simple and efficient method for supervised learning with neural networks: it transfers via weight sharing the knowledge learned with privileged information and approximately marginalizes over privileged information at test time. Our method, TRAM (TRansfer and Marginalize), has minimal training time overhead and has the same test-time cost as not using privileged information. TRAM performs strongly on CIFAR-10H, ImageNet and Civil Comments benchmarks.
△ Less
Submitted 15 June, 2022; v1 submitted 18 February, 2022;
originally announced February 2022.
-
Predicting the utility of search spaces for black-box optimization: a simple, budget-aware approach
Authors:
Setareh Ariafar,
Justin Gilmer,
Zachary Nado,
Jasper Snoek,
Rodolphe Jenatton,
George E. Dahl
Abstract:
Black box optimization requires specifying a search space to explore for solutions, e.g. a d-dimensional compact space, and this choice is critical for getting the best results at a reasonable budget. Unfortunately, determining a high quality search space can be challenging in many applications. For example, when tuning hyperparameters for machine learning pipelines on a new problem given a limite…
▽ More
Black box optimization requires specifying a search space to explore for solutions, e.g. a d-dimensional compact space, and this choice is critical for getting the best results at a reasonable budget. Unfortunately, determining a high quality search space can be challenging in many applications. For example, when tuning hyperparameters for machine learning pipelines on a new problem given a limited budget, one must strike a balance between excluding potentially promising regions and kee** the search space small enough to be tractable. The goal of this work is to motivate -- through example applications in tuning deep neural networks -- the problem of predicting the quality of search spaces conditioned on budgets, as well as to provide a simple scoring method based on a utility function applied to a probabilistic response surface model, similar to Bayesian optimization. We show that the method we present can compute meaningful budget-conditional scores in a variety of situations. We also provide experimental evidence that accurate scores can be useful in constructing and pruning search spaces. Ultimately, we believe scoring search spaces should become standard practice in the experimental workflow for deep learning.
△ Less
Submitted 16 December, 2021; v1 submitted 15 December, 2021;
originally announced December 2021.
-
Sparse MoEs meet Efficient Ensembles
Authors:
James Urquhart Allingham,
Florian Wenzel,
Zelda E Mariet,
Basil Mustafa,
Joan Puigcerver,
Neil Houlsby,
Ghassen Jerfel,
Vincent Fortuin,
Balaji Lakshminarayanan,
Jasper Snoek,
Dustin Tran,
Carlos Riquelme Ruiz,
Rodolphe Jenatton
Abstract:
Machine learning models based on the aggregated outputs of submodels, either at the activation or prediction levels, often exhibit strong performance compared to individual models. We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs). First, we show that the two approaches have complementary features whose combinatio…
▽ More
Machine learning models based on the aggregated outputs of submodels, either at the activation or prediction levels, often exhibit strong performance compared to individual models. We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs). First, we show that the two approaches have complementary features whose combination is beneficial. This includes a comprehensive evaluation of sparse MoEs in uncertainty related benchmarks. Then, we present Efficient Ensemble of Experts (E$^3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble. Extensive experiments demonstrate the accuracy, log-likelihood, few-shot learning, robustness, and uncertainty improvements of E$^3$ over several challenging vision Transformer-based baselines. E$^3$ not only preserves its efficiency while scaling to models with up to 2.7B parameters, but also provides better predictive performance and uncertainty estimates for larger models.
△ Less
Submitted 9 July, 2023; v1 submitted 7 October, 2021;
originally announced October 2021.
-
Deep Classifiers with Label Noise Modeling and Distance Awareness
Authors:
Vincent Fortuin,
Mark Collier,
Florian Wenzel,
James Allingham,
Jeremiah Liu,
Dustin Tran,
Balaji Lakshminarayanan,
Jesse Berent,
Rodolphe Jenatton,
Effrosyni Kokiopoulou
Abstract:
Uncertainty estimation in deep learning has recently emerged as a crucial area of interest to advance reliability and robustness in safety-critical applications. While there have been many proposed methods that either focus on distance-aware model uncertainties for out-of-distribution detection or on input-dependent label uncertainties for in-distribution calibration, both of these types of uncert…
▽ More
Uncertainty estimation in deep learning has recently emerged as a crucial area of interest to advance reliability and robustness in safety-critical applications. While there have been many proposed methods that either focus on distance-aware model uncertainties for out-of-distribution detection or on input-dependent label uncertainties for in-distribution calibration, both of these types of uncertainty are often necessary. In this work, we propose the HetSNGP method for jointly modeling the model and data uncertainty. We show that our proposed model affords a favorable combination between these two types of uncertainty and thus outperforms the baseline methods on some challenging out-of-distribution datasets, including CIFAR-100C, ImageNet-C, and ImageNet-A. Moreover, we propose HetSNGP Ensemble, an ensembled version of our method which additionally models uncertainty over the network parameters and outperforms other ensemble baselines.
△ Less
Submitted 8 August, 2022; v1 submitted 6 October, 2021;
originally announced October 2021.
-
Scaling Vision with Sparse Mixture of Experts
Authors:
Carlos Riquelme,
Joan Puigcerver,
Basil Mustafa,
Maxim Neumann,
Rodolphe Jenatton,
André Susano Pinto,
Daniel Keysers,
Neil Houlsby
Abstract:
Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing. In Computer Vision, however, almost all performant networks are "dense", that is, every input is processed by every parameter. We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks. When app…
▽ More
Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing. In Computer Vision, however, almost all performant networks are "dense", that is, every input is processed by every parameter. We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks. When applied to image recognition, V-MoE matches the performance of state-of-the-art networks, while requiring as little as half of the compute at inference time. Further, we propose an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute. This allows V-MoE to trade-off performance and compute smoothly at test-time. Finally, we demonstrate the potential of V-MoE to scale vision models, and train a 15B parameter model that attains 90.35% on ImageNet.
△ Less
Submitted 10 June, 2021;
originally announced June 2021.
-
Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning
Authors:
Zachary Nado,
Neil Band,
Mark Collier,
Josip Djolonga,
Michael W. Dusenberry,
Sebastian Farquhar,
Qixuan Feng,
Angelos Filos,
Marton Havasi,
Rodolphe Jenatton,
Ghassen Jerfel,
Jeremiah Liu,
Zelda Mariet,
Jeremy Nixon,
Shreyas Padhy,
Jie Ren,
Tim G. J. Rudner,
Faris Sbahi,
Yeming Wen,
Florian Wenzel,
Kevin Murphy,
D. Sculley,
Balaji Lakshminarayanan,
Jasper Snoek,
Yarin Gal
, et al. (1 additional authors not shown)
Abstract:
High-quality estimates of uncertainty and robustness are crucial for numerous real-world applications, especially for deep learning which underlies many deployed ML systems. The ability to compare techniques for improving these estimates is therefore very important for research and practice alike. Yet, competitive comparisons of methods are often lacking due to a range of reasons, including: compu…
▽ More
High-quality estimates of uncertainty and robustness are crucial for numerous real-world applications, especially for deep learning which underlies many deployed ML systems. The ability to compare techniques for improving these estimates is therefore very important for research and practice alike. Yet, competitive comparisons of methods are often lacking due to a range of reasons, including: compute availability for extensive tuning, incorporation of sufficiently many baselines, and concrete documentation for reproducibility. In this paper we introduce Uncertainty Baselines: high-quality implementations of standard and state-of-the-art deep learning methods on a variety of tasks. As of this writing, the collection spans 19 methods across 9 tasks, each with at least 5 metrics. Each baseline is a self-contained experiment pipeline with easily reusable and extendable components. Our goal is to provide immediate starting points for experimentation with new methods or applications. Additionally we provide model checkpoints, experiment outputs as Python notebooks, and leaderboards for comparing results. Code available at https://github.com/google/uncertainty-baselines.
△ Less
Submitted 5 January, 2022; v1 submitted 7 June, 2021;
originally announced June 2021.
-
Correlated Input-Dependent Label Noise in Large-Scale Image Classification
Authors:
Mark Collier,
Basil Mustafa,
Efi Kokiopoulou,
Rodolphe Jenatton,
Jesse Berent
Abstract:
Large scale image classification datasets often contain noisy labels. We take a principled probabilistic approach to modelling input-dependent, also known as heteroscedastic, label noise in these datasets. We place a multivariate Normal distributed latent variable on the final hidden layer of a neural network classifier. The covariance matrix of this latent variable, models the aleatoric uncertain…
▽ More
Large scale image classification datasets often contain noisy labels. We take a principled probabilistic approach to modelling input-dependent, also known as heteroscedastic, label noise in these datasets. We place a multivariate Normal distributed latent variable on the final hidden layer of a neural network classifier. The covariance matrix of this latent variable, models the aleatoric uncertainty due to label noise. We demonstrate that the learned covariance structure captures known sources of label noise between semantically similar and co-occurring classes. Compared to standard neural network training and other baselines, we show significantly improved accuracy on Imagenet ILSVRC 2012 79.3% (+2.6%), Imagenet-21k 47.0% (+1.1%) and JFT 64.7% (+1.6%). We set a new state-of-the-art result on WebVision 1.0 with 76.6% top-1 accuracy. These datasets range from over 1M to over 300M training examples and from 1k classes to more than 21k classes. Our method is simple to use, and we provide an implementation that is a drop-in replacement for the final fully-connected layer in a deep classifier.
△ Less
Submitted 19 May, 2021;
originally announced May 2021.
-
Amazon SageMaker Automatic Model Tuning: Scalable Gradient-Free Optimization
Authors:
Valerio Perrone,
Huibin Shen,
Aida Zolic,
Iaroslav Shcherbatyi,
Amr Ahmed,
Tanya Bansal,
Michele Donini,
Fela Winkelmolen,
Rodolphe Jenatton,
Jean Baptiste Faddoul,
Barbara Pogorzelska,
Miroslav Miladinovic,
Krishnaram Kenthapadi,
Matthias Seeger,
Cédric Archambeau
Abstract:
Tuning complex machine learning systems is challenging. Machine learning typically requires to set hyperparameters, be it regularization, architecture, or optimization parameters, whose tuning is critical to achieve good predictive performance. To democratize access to machine learning systems, it is essential to automate the tuning. This paper presents Amazon SageMaker Automatic Model Tuning (AMT…
▽ More
Tuning complex machine learning systems is challenging. Machine learning typically requires to set hyperparameters, be it regularization, architecture, or optimization parameters, whose tuning is critical to achieve good predictive performance. To democratize access to machine learning systems, it is essential to automate the tuning. This paper presents Amazon SageMaker Automatic Model Tuning (AMT), a fully managed system for gradient-free optimization at scale. AMT finds the best version of a trained machine learning model by repeatedly evaluating it with different hyperparameter configurations. It leverages either random search or Bayesian optimization to choose the hyperparameter values resulting in the best model, as measured by the metric chosen by the user. AMT can be used with built-in algorithms, custom algorithms, and Amazon SageMaker pre-built containers for machine learning frameworks. We discuss the core functionality, system architecture, our design principles, and lessons learned. We also describe more advanced features of AMT, such as automated early stop** and warm-starting, showing in experiments their benefits to users.
△ Less
Submitted 18 June, 2021; v1 submitted 15 December, 2020;
originally announced December 2020.
-
Amazon SageMaker Autopilot: a white box AutoML solution at scale
Authors:
Piali Das,
Valerio Perrone,
Nikita Ivkin,
Tanya Bansal,
Zohar Karnin,
Huibin Shen,
Iaroslav Shcherbatyi,
Yotam Elor,
Wilton Wu,
Aida Zolic,
Thibaut Lienart,
Alex Tang,
Amr Ahmed,
Jean Baptiste Faddoul,
Rodolphe Jenatton,
Fela Winkelmolen,
Philip Gautier,
Leo Dirac,
Andre Perunicic,
Miroslav Miladinovic,
Giovanni Zappella,
Cédric Archambeau,
Matthias Seeger,
Bhaskar Dutt,
Laurence Rouesnel
Abstract:
AutoML systems provide a black-box solution to machine learning problems by selecting the right way of processing features, choosing an algorithm and tuning the hyperparameters of the entire pipeline. Although these systems perform well on many datasets, there is still a non-negligible number of datasets for which the one-shot solution produced by each particular system would provide sub-par perfo…
▽ More
AutoML systems provide a black-box solution to machine learning problems by selecting the right way of processing features, choosing an algorithm and tuning the hyperparameters of the entire pipeline. Although these systems perform well on many datasets, there is still a non-negligible number of datasets for which the one-shot solution produced by each particular system would provide sub-par performance. In this paper, we present Amazon SageMaker Autopilot: a fully managed system providing an automated ML solution that can be modified when needed. Given a tabular dataset and the target column name, Autopilot identifies the problem type, analyzes the data and produces a diverse set of complete ML pipelines including feature preprocessing and ML algorithms, which are tuned to generate a leaderboard of candidate models. In the scenario where the performance is not satisfactory, a data scientist is able to view and edit the proposed ML pipelines in order to infuse their expertise and business knowledge without having to revert to a fully manual solution. This paper describes the different components of Autopilot, emphasizing the infrastructure choices that allow scalability, high quality models, editable ML pipelines, consumption of artifacts of offline meta-learning, and a convenient integration with the entire SageMaker suite allowing these trained models to be used in a production setting.
△ Less
Submitted 16 December, 2020; v1 submitted 15 December, 2020;
originally announced December 2020.
-
Training independent subnetworks for robust prediction
Authors:
Marton Havasi,
Rodolphe Jenatton,
Stanislav Fort,
Jeremiah Zhe Liu,
Jasper Snoek,
Balaji Lakshminarayanan,
Andrew M. Dai,
Dustin Tran
Abstract:
Recent approaches to efficiently ensemble neural networks have shown that strong robustness and uncertainty performance can be achieved with a negligible gain in parameters over the original network. However, these methods still require multiple forward passes for prediction, leading to a significant computational cost. In this work, we show a surprising result: the benefits of using multiple pred…
▽ More
Recent approaches to efficiently ensemble neural networks have shown that strong robustness and uncertainty performance can be achieved with a negligible gain in parameters over the original network. However, these methods still require multiple forward passes for prediction, leading to a significant computational cost. In this work, we show a surprising result: the benefits of using multiple predictions can be achieved `for free' under a single model's forward pass. In particular, we show that, using a multi-input multi-output (MIMO) configuration, one can utilize a single model's capacity to train multiple subnetworks that independently learn the task at hand. By ensembling the predictions made by the subnetworks, we improve model robustness without increasing compute. We observe a significant improvement in negative log-likelihood, accuracy, and calibration error on CIFAR10, CIFAR100, ImageNet, and their out-of-distribution variants compared to previous methods.
△ Less
Submitted 4 August, 2021; v1 submitted 13 October, 2020;
originally announced October 2020.
-
Hyperparameter Ensembles for Robustness and Uncertainty Quantification
Authors:
Florian Wenzel,
Jasper Snoek,
Dustin Tran,
Rodolphe Jenatton
Abstract:
Ensembles over neural network weights trained from different random initialization, known as deep ensembles, achieve state-of-the-art accuracy and calibration. The recently introduced batch ensembles provide a drop-in replacement that is more parameter efficient. In this paper, we design ensembles not only over weights, but over hyperparameters to improve the state of the art in both settings. For…
▽ More
Ensembles over neural network weights trained from different random initialization, known as deep ensembles, achieve state-of-the-art accuracy and calibration. The recently introduced batch ensembles provide a drop-in replacement that is more parameter efficient. In this paper, we design ensembles not only over weights, but over hyperparameters to improve the state of the art in both settings. For best performance independent of budget, we propose hyper-deep ensembles, a simple procedure that involves a random search over different hyperparameters, themselves stratified across multiple random initializations. Its strong performance highlights the benefit of combining models with both weight and hyperparameter diversity. We further propose a parameter efficient version, hyper-batch ensembles, which builds on the layer structure of batch ensembles and self-tuning networks. The computational and memory costs of our method are notably lower than typical ensembles. On image classification tasks, with MLP, LeNet, ResNet 20 and Wide ResNet 28-10 architectures, we improve upon both deep and batch ensembles.
△ Less
Submitted 8 January, 2021; v1 submitted 24 June, 2020;
originally announced June 2020.
-
On Mixup Regularization
Authors:
Luigi Carratino,
Moustapha Cissé,
Rodolphe Jenatton,
Jean-Philippe Vert
Abstract:
Mixup is a data augmentation technique that creates new examples as convex combinations of training points and labels. This simple technique has empirically shown to improve the accuracy of many state-of-the-art models in different settings and applications, but the reasons behind this empirical success remain poorly understood. In this paper we take a substantial step in explaining the theoretica…
▽ More
Mixup is a data augmentation technique that creates new examples as convex combinations of training points and labels. This simple technique has empirically shown to improve the accuracy of many state-of-the-art models in different settings and applications, but the reasons behind this empirical success remain poorly understood. In this paper we take a substantial step in explaining the theoretical foundations of Mixup, by clarifying its regularization effects. We show that Mixup can be interpreted as standard empirical risk minimization estimator subject to a combination of data transformation and random perturbation of the transformed data. We gain two core insights from this new interpretation. First, the data transformation suggests that, at test time, a model trained with Mixup should also be applied to transformed data, a one-line change in code that we show empirically to improve both accuracy and calibration of the prediction. Second, we show how the random perturbation of the new interpretation of Mixup induces multiple known regularization schemes, including label smoothing and reduction of the Lipschitz constant of the estimator. These schemes interact synergistically with each other, resulting in a self calibrated and effective regularization effect that prevents overfitting and overconfident predictions. We corroborate our theoretical analysis with experiments that support our conclusions.
△ Less
Submitted 17 October, 2022; v1 submitted 10 June, 2020;
originally announced June 2020.
-
A Simple Probabilistic Method for Deep Classification under Input-Dependent Label Noise
Authors:
Mark Collier,
Basil Mustafa,
Efi Kokiopoulou,
Rodolphe Jenatton,
Jesse Berent
Abstract:
Datasets with noisy labels are a common occurrence in practical applications of classification methods. We propose a simple probabilistic method for training deep classifiers under input-dependent (heteroscedastic) label noise. We assume an underlying heteroscedastic generative process for noisy labels. To make gradient based training feasible we use a temperature parameterized softmax as a smooth…
▽ More
Datasets with noisy labels are a common occurrence in practical applications of classification methods. We propose a simple probabilistic method for training deep classifiers under input-dependent (heteroscedastic) label noise. We assume an underlying heteroscedastic generative process for noisy labels. To make gradient based training feasible we use a temperature parameterized softmax as a smooth approximation to the assumed generative process. We illustrate that the softmax temperature controls a bias-variance trade-off for the approximation. By tuning the softmax temperature, we improve accuracy, log-likelihood and calibration on both image classification benchmarks with controlled label noise as well as Imagenet-21k which has naturally occurring label noise. For image segmentation, our method increases the mean IoU on the PASCAL VOC and Cityscapes datasets by more than 1% over the state-of-the-art model.
△ Less
Submitted 12 November, 2020; v1 submitted 15 March, 2020;
originally announced March 2020.
-
The k-tied Normal Distribution: A Compact Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks
Authors:
Jakub Swiatkowski,
Kevin Roth,
Bastiaan S. Veeling,
Linh Tran,
Joshua V. Dillon,
Jasper Snoek,
Stephan Mandt,
Tim Salimans,
Rodolphe Jenatton,
Sebastian Nowozin
Abstract:
Variational Bayesian Inference is a popular methodology for approximating posterior distributions over Bayesian neural network weights. Recent work develo** this class of methods has explored ever richer parameterizations of the approximate posterior in the hope of improving performance. In contrast, here we share a curious experimental finding that suggests instead restricting the variational d…
▽ More
Variational Bayesian Inference is a popular methodology for approximating posterior distributions over Bayesian neural network weights. Recent work develo** this class of methods has explored ever richer parameterizations of the approximate posterior in the hope of improving performance. In contrast, here we share a curious experimental finding that suggests instead restricting the variational distribution to a more compact parameterization. For a variety of deep Bayesian neural networks trained using Gaussian mean-field variational inference, we find that the posterior standard deviations consistently exhibit strong low-rank structure after convergence. This means that by decomposing these variational parameters into a low-rank factorization, we can make our variational approximation more compact without decreasing the models' performance. Furthermore, we find that such factorized parameterizations improve the signal-to-noise ratio of stochastic gradient estimates of the variational lower bound, resulting in faster convergence.
△ Less
Submitted 5 July, 2020; v1 submitted 7 February, 2020;
originally announced February 2020.
-
How Good is the Bayes Posterior in Deep Neural Networks Really?
Authors:
Florian Wenzel,
Kevin Roth,
Bastiaan S. Veeling,
Jakub Świątkowski,
Linh Tran,
Stephan Mandt,
Jasper Snoek,
Tim Salimans,
Rodolphe Jenatton,
Sebastian Nowozin
Abstract:
During the past five years the Bayesian deep learning community has developed increasingly accurate and efficient approximate inference procedures that allow for Bayesian inference in deep neural networks. However, despite this algorithmic progress and the promise of improved uncertainty quantification and sample efficiency there are---as of early 2020---no publicized deployments of Bayesian neura…
▽ More
During the past five years the Bayesian deep learning community has developed increasingly accurate and efficient approximate inference procedures that allow for Bayesian inference in deep neural networks. However, despite this algorithmic progress and the promise of improved uncertainty quantification and sample efficiency there are---as of early 2020---no publicized deployments of Bayesian neural networks in industrial practice. In this work we cast doubt on the current understanding of Bayes posteriors in popular deep neural networks: we demonstrate through careful MCMC sampling that the posterior predictive induced by the Bayes posterior yields systematically worse predictions compared to simpler methods including point estimates obtained from SGD. Furthermore, we demonstrate that predictive performance is improved significantly through the use of a "cold posterior" that overcounts evidence. Such cold posteriors sharply deviate from the Bayesian paradigm but are commonly used as heuristic in Bayesian deep learning papers. We put forward several hypotheses that could explain cold posteriors and evaluate the hypotheses through experiments. Our work questions the goal of accurate posterior approximations in Bayesian deep learning: If the true Bayes posterior is poor, what is the use of more accurate approximations? Instead, we argue that it is timely to focus on understanding the origin of the improved performance of cold posteriors.
△ Less
Submitted 2 July, 2020; v1 submitted 6 February, 2020;
originally announced February 2020.
-
Hydra: Preserving Ensemble Diversity for Model Distillation
Authors:
Linh Tran,
Bastiaan S. Veeling,
Kevin Roth,
Jakub Swiatkowski,
Joshua V. Dillon,
Jasper Snoek,
Stephan Mandt,
Tim Salimans,
Sebastian Nowozin,
Rodolphe Jenatton
Abstract:
Ensembles of models have been empirically shown to improve predictive performance and to yield robust measures of uncertainty. However, they are expensive in computation and memory. Therefore, recent research has focused on distilling ensembles into a single compact model, reducing the computational and memory burden of the ensemble while trying to preserve its predictive behavior. Most existing d…
▽ More
Ensembles of models have been empirically shown to improve predictive performance and to yield robust measures of uncertainty. However, they are expensive in computation and memory. Therefore, recent research has focused on distilling ensembles into a single compact model, reducing the computational and memory burden of the ensemble while trying to preserve its predictive behavior. Most existing distillation formulations summarize the ensemble by capturing its average predictions. As a result, the diversity of the ensemble predictions, stemming from each member, is lost. Thus, the distilled model cannot provide a measure of uncertainty comparable to that of the original ensemble. To retain more faithfully the diversity of the ensemble, we propose a distillation method based on a single multi-headed neural network, which we refer to as Hydra. The shared body network learns a joint feature representation that enables each head to capture the predictive behavior of each ensemble member. We demonstrate that with a slight increase in parameter count, Hydra improves distillation performance on classification and regression settings while capturing the uncertainty behavior of the original ensemble over both in-domain and out-of-distribution tasks.
△ Less
Submitted 19 March, 2021; v1 submitted 14 January, 2020;
originally announced January 2020.
-
Constrained Bayesian Optimization with Max-Value Entropy Search
Authors:
Valerio Perrone,
Iaroslav Shcherbatyi,
Rodolphe Jenatton,
Cedric Archambeau,
Matthias Seeger
Abstract:
Bayesian optimization (BO) is a model-based approach to sequentially optimize expensive black-box functions, such as the validation error of a deep neural network with respect to its hyperparameters. In many real-world scenarios, the optimization is further subject to a priori unknown constraints. For example, training a deep network configuration may fail with an out-of-memory error when the mode…
▽ More
Bayesian optimization (BO) is a model-based approach to sequentially optimize expensive black-box functions, such as the validation error of a deep neural network with respect to its hyperparameters. In many real-world scenarios, the optimization is further subject to a priori unknown constraints. For example, training a deep network configuration may fail with an out-of-memory error when the model is too large. In this work, we focus on a general formulation of Gaussian process-based BO with continuous or binary constraints. We propose constrained Max-value Entropy Search (cMES), a novel information theoretic-based acquisition function implementing this formulation. We also revisit the validity of the factorized approximation adopted for rapid computation of the MES acquisition function, showing empirically that this leads to inaccurate results. On an extensive set of real-world constrained hyperparameter optimization problems we show that cMES compares favourably to prior work, while being simpler to implement and faster than other constrained extensions of Entropy Search.
△ Less
Submitted 15 October, 2019;
originally announced October 2019.
-
Learning search spaces for Bayesian optimization: Another view of hyperparameter transfer learning
Authors:
Valerio Perrone,
Huibin Shen,
Matthias Seeger,
Cedric Archambeau,
Rodolphe Jenatton
Abstract:
Bayesian optimization (BO) is a successful methodology to optimize black-box functions that are expensive to evaluate. While traditional methods optimize each black-box function in isolation, there has been recent interest in speeding up BO by transferring knowledge across multiple related black-box functions. In this work, we introduce a method to automatically design the BO search space by relyi…
▽ More
Bayesian optimization (BO) is a successful methodology to optimize black-box functions that are expensive to evaluate. While traditional methods optimize each black-box function in isolation, there has been recent interest in speeding up BO by transferring knowledge across multiple related black-box functions. In this work, we introduce a method to automatically design the BO search space by relying on evaluations of previous black-box functions. We depart from the common practice of defining a set of arbitrary search ranges a priori by considering search space geometries that are learned from historical data. This simple, yet effective strategy can be used to endow many existing BO methods with transfer learning properties. Despite its simplicity, we show that our approach considerably boosts BO by reducing the size of the search space, thus accelerating the optimization of a variety of black-box optimization problems. In particular, the proposed approach combined with random search results in a parameter-free, easy-to-implement, robust hyperparameter optimization strategy. We hope it will constitute a natural baseline for further research attempting to warm-start BO.
△ Less
Submitted 27 September, 2019;
originally announced September 2019.
-
Online optimization and regret guarantees for non-additive long-term constraints
Authors:
Rodolphe Jenatton,
Jim Huang,
Dominik Csiba,
Cedric Archambeau
Abstract:
We consider online optimization in the 1-lookahead setting, where the objective does not decompose additively over the rounds of the online game. The resulting formulation enables us to deal with non-stationary and/or long-term constraints , which arise, for example, in online display advertising problems. We propose an on-line primal-dual algorithm for which we obtain dynamic cumulative regret gu…
▽ More
We consider online optimization in the 1-lookahead setting, where the objective does not decompose additively over the rounds of the online game. The resulting formulation enables us to deal with non-stationary and/or long-term constraints , which arise, for example, in online display advertising problems. We propose an on-line primal-dual algorithm for which we obtain dynamic cumulative regret guarantees. They depend on the convexity and the smoothness of the non-additive penalty, as well as terms capturing the smoothness with which the residuals of the non-stationary and long-term constraints vary over the rounds. We conduct experiments on synthetic data to illustrate the benefits of the non-additive penalty and show vanishing regret convergence on live traffic data collected by a display advertising platform in production.
△ Less
Submitted 8 June, 2016; v1 submitted 17 February, 2016;
originally announced February 2016.
-
Adaptive Algorithms for Online Convex Optimization with Long-term Constraints
Authors:
Rodolphe Jenatton,
Jim Huang,
Cédric Archambeau
Abstract:
We present an adaptive online gradient descent algorithm to solve online convex optimization problems with long-term constraints , which are constraints that need to be satisfied when accumulated over a finite number of rounds T , but can be violated in intermediate rounds. For some user-defined trade-off parameter $β$ $\in$ (0, 1), the proposed algorithm achieves cumulative regret bounds of O(T^m…
▽ More
We present an adaptive online gradient descent algorithm to solve online convex optimization problems with long-term constraints , which are constraints that need to be satisfied when accumulated over a finite number of rounds T , but can be violated in intermediate rounds. For some user-defined trade-off parameter $β$ $\in$ (0, 1), the proposed algorithm achieves cumulative regret bounds of O(T^max{$β$,1--$β$}) and O(T^(1--$β$/2)) for the loss and the constraint violations respectively. Our results hold for convex losses and can handle arbitrary convex constraints without requiring knowledge of the number of rounds in advance. Our contributions improve over the best known cumulative regret bounds by Mahdavi, et al. (2012) that are respectively O(T^1/2) and O(T^3/4) for general convex domains, and respectively O(T^2/3) and O(T^2/3) when further restricting to polyhedral domains. We supplement the analysis with experiments validating the performance of our algorithm in practice.
△ Less
Submitted 23 December, 2015;
originally announced December 2015.
-
Sparse and spurious: dictionary learning with noise and outliers
Authors:
Rémi Gribonval,
Rodolphe Jenatton,
Francis Bach
Abstract:
A popular approach within the signal processing and machine learning communities consists in modelling signals as sparse linear combinations of atoms selected from a learned dictionary. While this paradigm has led to numerous empirical successes in various fields ranging from image to audio processing, there have only been a few theoretical arguments supporting these evidences. In particular, spar…
▽ More
A popular approach within the signal processing and machine learning communities consists in modelling signals as sparse linear combinations of atoms selected from a learned dictionary. While this paradigm has led to numerous empirical successes in various fields ranging from image to audio processing, there have only been a few theoretical arguments supporting these evidences. In particular, sparse coding, or sparse dictionary learning, relies on a non-convex procedure whose local minima have not been fully analyzed yet. In this paper, we consider a probabilistic model of sparse signals, and show that, with high probability, sparse coding admits a local minimum around the reference dictionary generating the signals. Our study takes into account the case of over-complete dictionaries, noisy signals, and possible outliers, thus extending previous work limited to noiseless settings and/or under-complete dictionaries. The analysis we conduct is non-asymptotic and makes it possible to understand how the key quantities of the problem, such as the coherence or the level of noise, can scale with respect to the dimension of the signals, the number of atoms, the sparsity and the number of observations.
△ Less
Submitted 22 August, 2015; v1 submitted 19 July, 2014;
originally announced July 2014.
-
Sample Complexity of Dictionary Learning and other Matrix Factorizations
Authors:
Rémi Gribonval,
Rodolphe Jenatton,
Francis Bach,
Martin Kleinsteuber,
Matthias Seibert
Abstract:
Many modern tools in machine learning and signal processing, such as sparse dictionary learning, principal component analysis (PCA), non-negative matrix factorization (NMF), $K$-means clustering, etc., rely on the factorization of a matrix obtained by concatenating high-dimensional vectors from a training collection. While the idealized task would be to optimize the expected quality of the factors…
▽ More
Many modern tools in machine learning and signal processing, such as sparse dictionary learning, principal component analysis (PCA), non-negative matrix factorization (NMF), $K$-means clustering, etc., rely on the factorization of a matrix obtained by concatenating high-dimensional vectors from a training collection. While the idealized task would be to optimize the expected quality of the factors over the underlying distribution of training vectors, it is achieved in practice by minimizing an empirical average over the considered collection. The focus of this paper is to provide sample complexity estimates to uniformly control how much the empirical average deviates from the expected cost function. Standard arguments imply that the performance of the empirical predictor also exhibit such guarantees. The level of genericity of the approach encompasses several possible constraints on the factors (tensor product structure, shift-invariance, sparsity \ldots), thus providing a unified perspective on the sample complexity of several widely used matrix factorization schemes. The derived generalization bounds behave proportional to $\sqrt{\log(n)/n}$ w.r.t.\ the number of samples $n$ for the considered matrix factorization techniques.
△ Less
Submitted 9 April, 2015; v1 submitted 13 December, 2013;
originally announced December 2013.
-
Local stability and robustness of sparse dictionary learning in the presence of noise
Authors:
Rodolphe Jenatton,
Rémi Gribonval,
Francis Bach
Abstract:
A popular approach within the signal processing and machine learning communities consists in modelling signals as sparse linear combinations of atoms selected from a learned dictionary. While this paradigm has led to numerous empirical successes in various fields ranging from image to audio processing, there have only been a few theoretical arguments supporting these evidences. In particular, spar…
▽ More
A popular approach within the signal processing and machine learning communities consists in modelling signals as sparse linear combinations of atoms selected from a learned dictionary. While this paradigm has led to numerous empirical successes in various fields ranging from image to audio processing, there have only been a few theoretical arguments supporting these evidences. In particular, sparse coding, or sparse dictionary learning, relies on a non-convex procedure whose local minima have not been fully analyzed yet. In this paper, we consider a probabilistic model of sparse signals, and show that, with high probability, sparse coding admits a local minimum around the reference dictionary generating the signals. Our study takes into account the case of over-complete dictionaries and noisy signals, thus extending previous work limited to noiseless settings and/or under-complete dictionaries. The analysis we conduct is non-asymptotic and makes it possible to understand how the key quantities of the problem, such as the coherence or the level of noise, can scale with respect to the dimension of the signals, the number of atoms, the sparsity and the number of observations.
△ Less
Submitted 2 October, 2012;
originally announced October 2012.
-
Learning Hierarchical and Topographic Dictionaries with Structured Sparsity
Authors:
Julien Mairal,
Rodolphe Jenatton,
Guillaume Obozinski,
Francis Bach
Abstract:
Recent work in signal processing and statistics have focused on defining new regularization functions, which not only induce sparsity of the solution, but also take into account the structure of the problem. We present in this paper a class of convex penalties introduced in the machine learning community, which take the form of a sum of l_2 and l_infinity-norms over groups of variables. They exten…
▽ More
Recent work in signal processing and statistics have focused on defining new regularization functions, which not only induce sparsity of the solution, but also take into account the structure of the problem. We present in this paper a class of convex penalties introduced in the machine learning community, which take the form of a sum of l_2 and l_infinity-norms over groups of variables. They extend the classical group-sparsity regularization in the sense that the groups possibly overlap, allowing more flexibility in the group design. We review efficient optimization methods to deal with the corresponding inverse problems, and their application to the problem of learning dictionaries of natural image patches: On the one hand, dictionary learning has indeed proven effective for various signal processing tasks. On the other hand, structured sparsity provides a natural framework for modeling dependencies between dictionary elements. We thus consider a structured sparse regularization to learn dictionaries embedded in a particular structure, for instance a tree or a two-dimensional grid. In the latter case, the results we obtain are similar to the dictionaries produced by topographic independent component analysis.
△ Less
Submitted 20 October, 2011;
originally announced October 2011.
-
Structured sparsity through convex optimization
Authors:
Francis Bach,
Rodolphe Jenatton,
Julien Mairal,
Guillaume Obozinski
Abstract:
Sparse estimation methods are aimed at using or obtaining parsimonious representations of data or models. While naturally cast as a combinatorial optimization problem, variable or feature selection admits a convex relaxation through the regularization by the $\ell_1$-norm. In this paper, we consider situations where we are not only interested in sparsity, but where some structural prior knowledge…
▽ More
Sparse estimation methods are aimed at using or obtaining parsimonious representations of data or models. While naturally cast as a combinatorial optimization problem, variable or feature selection admits a convex relaxation through the regularization by the $\ell_1$-norm. In this paper, we consider situations where we are not only interested in sparsity, but where some structural prior knowledge is available as well. We show that the $\ell_1$-norm can then be extended to structured norms built on either disjoint or overlap** groups of variables, leading to a flexible framework that can deal with various structures. We present applications to unsupervised learning, for structured sparse principal component analysis and hierarchical dictionary learning, and to supervised learning in the context of non-linear variable selection.
△ Less
Submitted 20 April, 2012; v1 submitted 12 September, 2011;
originally announced September 2011.
-
Optimization with Sparsity-Inducing Penalties
Authors:
Francis Bach,
Rodolphe Jenatton,
Julien Mairal,
Guillaume Obozinski
Abstract:
Sparse estimation methods are aimed at using or obtaining parsimonious representations of data or models. They were first dedicated to linear variable selection but numerous extensions have now emerged such as structured sparsity or kernel selection. It turns out that many of the related estimation problems can be cast as convex optimization problems by regularizing the empirical risk with appropr…
▽ More
Sparse estimation methods are aimed at using or obtaining parsimonious representations of data or models. They were first dedicated to linear variable selection but numerous extensions have now emerged such as structured sparsity or kernel selection. It turns out that many of the related estimation problems can be cast as convex optimization problems by regularizing the empirical risk with appropriate non-smooth norms. The goal of this paper is to present from a general perspective optimization tools and techniques dedicated to such sparsity-inducing penalties. We cover proximal methods, block-coordinate descent, reweighted $\ell_2$-penalized techniques, working-set and homotopy methods, as well as non-convex formulations and extensions, and provide an extensive set of experiments to compare various algorithms from a computational point of view.
△ Less
Submitted 22 November, 2011; v1 submitted 3 August, 2011;
originally announced August 2011.
-
Convex and Network Flow Optimization for Structured Sparsity
Authors:
Julien Mairal,
Rodolphe Jenatton,
Guillaume Obozinski,
Francis Bach
Abstract:
We consider a class of learning problems regularized by a structured sparsity-inducing norm defined as the sum of l_2- or l_infinity-norms over groups of variables. Whereas much effort has been put in develo** fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlap** groups. To this end, we present two different strategi…
▽ More
We consider a class of learning problems regularized by a structured sparsity-inducing norm defined as the sum of l_2- or l_infinity-norms over groups of variables. Whereas much effort has been put in develo** fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlap** groups. To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of l_infinity-norms can be computed exactly in polynomial time by solving a quadratic min-cost flow problem, allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal splitting techniques, and address an equivalent formulation with non-overlap** groups, but in higher dimension and with additional constraints. We propose efficient and scalable algorithms exploiting these two strategies, which are significantly faster than alternative approaches. We illustrate these methods with several problems such as CUR matrix factorization, multi-task learning of tree-structured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches.
△ Less
Submitted 16 September, 2011; v1 submitted 11 April, 2011;
originally announced April 2011.
-
Network Flow Algorithms for Structured Sparsity
Authors:
Julien Mairal,
Rodolphe Jenatton,
Guillaume Obozinski,
Francis Bach
Abstract:
We consider a class of learning problems that involve a structured sparsity-inducing norm defined as the sum of $\ell_\infty$-norms over groups of variables. Whereas a lot of effort has been put in develo** fast optimization methods when the groups are disjoint or embedded in a specific hierarchical structure, we address here the case of general overlap** groups. To this end, we show that the…
▽ More
We consider a class of learning problems that involve a structured sparsity-inducing norm defined as the sum of $\ell_\infty$-norms over groups of variables. Whereas a lot of effort has been put in develo** fast optimization methods when the groups are disjoint or embedded in a specific hierarchical structure, we address here the case of general overlap** groups. To this end, we show that the corresponding optimization problem is related to network flow optimization. More precisely, the proximal problem associated with the norm we consider is dual to a quadratic min-cost flow problem. We propose an efficient procedure which computes its solution exactly in polynomial time. Our algorithm scales up to millions of variables, and opens up a whole new range of applications for structured sparse models. We present several experiments on image and video data, demonstrating the applicability and scalability of our approach for various problems.
△ Less
Submitted 30 August, 2010;
originally announced August 2010.