-
Can Implicit Bias Imply Adversarial Robustness?
Authors:
Hancheng Min,
René Vidal
Abstract:
The implicit bias of gradient-based training algorithms has been considered mostly beneficial as it leads to trained networks that often generalize well. However, Frei et al. (2023) show that such implicit bias can harm adversarial robustness. Specifically, they show that if the data consists of clusters with small inter-cluster correlation, a shallow (two-layer) ReLU network trained by gradient f…
▽ More
The implicit bias of gradient-based training algorithms has been considered mostly beneficial as it leads to trained networks that often generalize well. However, Frei et al. (2023) show that such implicit bias can harm adversarial robustness. Specifically, they show that if the data consists of clusters with small inter-cluster correlation, a shallow (two-layer) ReLU network trained by gradient flow generalizes well, but it is not robust to adversarial attacks of small radius. Moreover, this phenomenon occurs despite the existence of a much more robust classifier that can be explicitly constructed from a shallow network. In this paper, we extend recent analyses of neuron alignment to show that a shallow network with a polynomial ReLU activation (pReLU) trained by gradient flow not only generalizes well but is also robust to adversarial attacks. Our results highlight the importance of the interplay between data structure and architecture design in the implicit bias and robustness of trained networks.
△ Less
Submitted 5 June, 2024; v1 submitted 24 May, 2024;
originally announced May 2024.
-
Stochastic Extragradient with Random Reshuffling: Improved Convergence for Variational Inequalities
Authors:
Konstantinos Emmanouilidis,
René Vidal,
Nicolas Loizou
Abstract:
The Stochastic Extragradient (SEG) method is one of the most popular algorithms for solving finite-sum min-max optimization and variational inequality problems (VIPs) appearing in various machine learning tasks. However, existing convergence analyses of SEG focus on its with-replacement variants, while practical implementations of the method randomly reshuffle components and sequentially use them.…
▽ More
The Stochastic Extragradient (SEG) method is one of the most popular algorithms for solving finite-sum min-max optimization and variational inequality problems (VIPs) appearing in various machine learning tasks. However, existing convergence analyses of SEG focus on its with-replacement variants, while practical implementations of the method randomly reshuffle components and sequentially use them. Unlike the well-studied with-replacement variants, SEG with Random Reshuffling (SEG-RR) lacks established theoretical guarantees. In this work, we provide a convergence analysis of SEG-RR for three classes of VIPs: (i) strongly monotone, (ii) affine, and (iii) monotone. We derive conditions under which SEG-RR achieves a faster convergence rate than the uniform with-replacement sampling SEG. In the monotone setting, our analysis of SEG-RR guarantees convergence to an arbitrary accuracy without large batch sizes, a strong requirement needed in the classical with-replacement SEG. As a byproduct of our results, we provide convergence guarantees for Shuffle Once SEG (shuffles the data only at the beginning of the algorithm) and the Incremental Extragradient (does not shuffle the data). We supplement our analysis with experiments validating empirically the superior performance of SEG-RR over the classical with-replacement sampling SEG.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
Variational Information Pursuit with Large Language and Multimodal Models for Interpretable Predictions
Authors:
Kwan Ho Ryan Chan,
Aditya Chattopadhyay,
Benjamin David Haeffele,
Rene Vidal
Abstract:
Variational Information Pursuit (V-IP) is a framework for making interpretable predictions by design by sequentially selecting a short chain of task-relevant, user-defined and interpretable queries about the data that are most informative for the task. While this allows for built-in interpretability in predictive models, applying V-IP to any task requires data samples with dense concept-labeling b…
▽ More
Variational Information Pursuit (V-IP) is a framework for making interpretable predictions by design by sequentially selecting a short chain of task-relevant, user-defined and interpretable queries about the data that are most informative for the task. While this allows for built-in interpretability in predictive models, applying V-IP to any task requires data samples with dense concept-labeling by domain experts, limiting the application of V-IP to small-scale tasks where manual data annotation is feasible. In this work, we extend the V-IP framework with Foundational Models (FMs) to address this limitation. More specifically, we use a two-step process, by first leveraging Large Language Models (LLMs) to generate a sufficiently large candidate set of task-relevant interpretable concepts, then using Large Multimodal Models to annotate each data sample by semantic similarity with each concept in the generated concept set. While other interpretable-by-design frameworks such as Concept Bottleneck Models (CBMs) require an additional step of removing repetitive and non-discriminative concepts to have good interpretability and test performance, we mathematically and empirically justify that, with a sufficiently informative and task-relevant query (concept) set, the proposed FM+V-IP method does not require any type of concept filtering. In addition, we show that FM+V-IP with LLM generated concepts can achieve better test performance than V-IP with human annotated concepts, demonstrating the effectiveness of LLMs at generating efficient query sets. Finally, when compared to other interpretable-by-design frameworks such as CBMs, FM+V-IP can achieve competitive test performance using fewer number of concepts/queries in both cases with filtered or unfiltered concept sets.
△ Less
Submitted 24 August, 2023;
originally announced August 2023.
-
Variational Information Pursuit for Interpretable Predictions
Authors:
Aditya Chattopadhyay,
Kwan Ho Ryan Chan,
Benjamin D. Haeffele,
Donald Geman,
René Vidal
Abstract:
There is a growing interest in the machine learning community in develo** predictive algorithms that are "interpretable by design". Towards this end, recent work proposes to make interpretable decisions by sequentially asking interpretable queries about data until a prediction can be made with high confidence based on the answers obtained (the history). To promote short query-answer chains, a gr…
▽ More
There is a growing interest in the machine learning community in develo** predictive algorithms that are "interpretable by design". Towards this end, recent work proposes to make interpretable decisions by sequentially asking interpretable queries about data until a prediction can be made with high confidence based on the answers obtained (the history). To promote short query-answer chains, a greedy procedure called Information Pursuit (IP) is used, which adaptively chooses queries in order of information gain. Generative models are employed to learn the distribution of query-answers and labels, which is in turn used to estimate the most informative query. However, learning and inference with a full generative model of the data is often intractable for complex tasks. In this work, we propose Variational Information Pursuit (V-IP), a variational characterization of IP which bypasses the need for learning generative models. V-IP is based on finding a query selection strategy and a classifier that minimizes the expected cross-entropy between true and predicted labels. We then demonstrate that the IP strategy is the optimal solution to this problem. Therefore, instead of learning generative models, we can use our optimal strategy to directly pick the most informative query given any history. We then develop a practical algorithm by defining a finite-dimensional parameterization of our strategy and classifier using deep networks and train them end-to-end using our objective. Empirically, V-IP is 10-100x faster than IP on different Vision and NLP tasks with competitive performance. Moreover, V-IP finds much shorter query chains when compared to reinforcement learning which is typically used in sequential-decision-making problems. Finally, we demonstrate the utility of V-IP on challenging tasks like medical diagnosis where the performance is far superior to the generative modelling approach.
△ Less
Submitted 15 February, 2023; v1 submitted 6 February, 2023;
originally announced February 2023.
-
Federated Learning for Data Streams
Authors:
Othmane Marfoq,
Giovanni Neglia,
Laetitia Kameni,
Richard Vidal
Abstract:
Federated learning (FL) is an effective solution to train machine learning models on the increasing amount of data generated by IoT devices and smartphones while kee** such data localized. Most previous work on federated learning assumes that clients operate on static datasets collected before training starts. This approach may be inefficient because 1) it ignores new samples clients collect dur…
▽ More
Federated learning (FL) is an effective solution to train machine learning models on the increasing amount of data generated by IoT devices and smartphones while kee** such data localized. Most previous work on federated learning assumes that clients operate on static datasets collected before training starts. This approach may be inefficient because 1) it ignores new samples clients collect during training, and 2) it may require a potentially long preparatory phase for clients to collect enough data. Moreover, learning on static datasets may be simply impossible in scenarios with small aggregate storage across devices. It is, therefore, necessary to design federated algorithms able to learn from data streams. In this work, we formulate and study the problem of \emph{federated learning for data streams}. We propose a general FL algorithm to learn from data streams through an opportune weighted empirical risk minimization. Our theoretical analysis provides insights to configure such an algorithm, and we evaluate its performance on a wide range of machine learning tasks.
△ Less
Submitted 4 January, 2023;
originally announced January 2023.
-
Personalized Federated Learning through Local Memorization
Authors:
Othmane Marfoq,
Giovanni Neglia,
Laetitia Kameni,
Richard Vidal
Abstract:
Federated learning allows clients to collaboratively learn statistical models while kee** their data local. Federated learning was originally used to train a unique global model to be served to all clients, but this approach might be sub-optimal when clients' local data distributions are heterogeneous. In order to tackle this limitation, recent personalized federated learning methods train a sep…
▽ More
Federated learning allows clients to collaboratively learn statistical models while kee** their data local. Federated learning was originally used to train a unique global model to be served to all clients, but this approach might be sub-optimal when clients' local data distributions are heterogeneous. In order to tackle this limitation, recent personalized federated learning methods train a separate model for each client while still leveraging the knowledge available at other clients. In this work, we exploit the ability of deep neural networks to extract high quality vectorial representations (embeddings) from non-tabular data, e.g., images and text, to propose a personalization mechanism based on local memorization. Personalization is obtained by interpolating a collectively trained global model with a local $k$-nearest neighbors (kNN) model based on the shared representation provided by the global model. We provide generalization bounds for the proposed approach in the case of binary classification, and we show on a suite of federated datasets that this approach achieves significantly higher accuracy and fairness than state-of-the-art methods.
△ Less
Submitted 17 June, 2022; v1 submitted 17 November, 2021;
originally announced November 2021.
-
Federated Multi-Task Learning under a Mixture of Distributions
Authors:
Othmane Marfoq,
Giovanni Neglia,
Aurélien Bellet,
Laetitia Kameni,
Richard Vidal
Abstract:
The increasing size of data generated by smartphones and IoT devices motivated the development of Federated Learning (FL), a framework for on-device collaborative training of machine learning models. First efforts in FL focused on learning a single global model with good average performance across clients, but the global model may be arbitrarily bad for a given client, due to the inherent heteroge…
▽ More
The increasing size of data generated by smartphones and IoT devices motivated the development of Federated Learning (FL), a framework for on-device collaborative training of machine learning models. First efforts in FL focused on learning a single global model with good average performance across clients, but the global model may be arbitrarily bad for a given client, due to the inherent heterogeneity of local data distributions. Federated multi-task learning (MTL) approaches can learn personalized models by formulating an opportune penalized optimization problem. The penalization term can capture complex relations among personalized models, but eschews clear statistical assumptions about local data distributions. In this work, we propose to study federated MTL under the flexible assumption that each local data distribution is a mixture of unknown underlying distributions. This assumption encompasses most of the existing personalized FL approaches and leads to federated EM-like algorithms for both client-server and fully decentralized settings. Moreover, it provides a principled way to serve personalized models to clients not seen at training time. The algorithms' convergence is analyzed through a novel federated surrogate optimization framework, which can be of general interest. Experimental results on FL benchmarks show that our approach provides models with higher accuracy and fairness than state-of-the-art methods.
△ Less
Submitted 7 November, 2022; v1 submitted 23 August, 2021;
originally announced August 2021.
-
A Game Theoretic Analysis of Additive Adversarial Attacks and Defenses
Authors:
Ambar Pal,
René Vidal
Abstract:
Research in adversarial learning follows a cat and mouse game between attackers and defenders where attacks are proposed, they are mitigated by new defenses, and subsequently new attacks are proposed that break earlier defenses, and so on. However, it has remained unclear as to whether there are conditions under which no better attacks or defenses can be proposed. In this paper, we propose a game-…
▽ More
Research in adversarial learning follows a cat and mouse game between attackers and defenders where attacks are proposed, they are mitigated by new defenses, and subsequently new attacks are proposed that break earlier defenses, and so on. However, it has remained unclear as to whether there are conditions under which no better attacks or defenses can be proposed. In this paper, we propose a game-theoretic framework for studying attacks and defenses which exist in equilibrium. Under a locally linear decision boundary model for the underlying binary classifier, we prove that the Fast Gradient Method attack and the Randomized Smoothing defense form a Nash Equilibrium. We then show how this equilibrium defense can be approximated given finitely many samples from a data-generating distribution, and derive a generalization bound for the performance of our approximation.
△ Less
Submitted 11 November, 2020; v1 submitted 14 September, 2020;
originally announced September 2020.
-
Free-rider Attacks on Model Aggregation in Federated Learning
Authors:
Yann Fraboni,
Richard Vidal,
Marco Lorenzi
Abstract:
Free-rider attacks against federated learning consist in dissimulating participation to the federated learning process with the goal of obtaining the final aggregated model without actually contributing with any data. This kind of attacks is critical in sensitive applications of federated learning, where data is scarce and the model has high commercial value. We introduce here the first theoretica…
▽ More
Free-rider attacks against federated learning consist in dissimulating participation to the federated learning process with the goal of obtaining the final aggregated model without actually contributing with any data. This kind of attacks is critical in sensitive applications of federated learning, where data is scarce and the model has high commercial value. We introduce here the first theoretical and experimental analysis of free-rider attacks on federated learning schemes based on iterative parameters aggregation, such as FedAvg or FedProx, and provide formal guarantees for these attacks to converge to the aggregated models of the fair participants. We first show that a straightforward implementation of this attack can be simply achieved by not updating the local parameters during the iterative federated optimization. As this attack can be detected by adopting simple countermeasures at the server level, we subsequently study more complex disguising schemes based on stochastic updates of the free-rider parameters. We demonstrate the proposed strategies on a number of experimental scenarios, in both iid and non-iid settings. We conclude by providing recommendations to avoid free-rider attacks in real world applications of federated learning, especially in sensitive domains where security of data and models is critical.
△ Less
Submitted 22 February, 2021; v1 submitted 21 June, 2020;
originally announced June 2020.
-
Is an Affine Constraint Needed for Affine Subspace Clustering?
Authors:
Chong You,
Chun-Guang Li,
Daniel P. Robinson,
Rene Vidal
Abstract:
Subspace clustering methods based on expressing each data point as a linear combination of other data points have achieved great success in computer vision applications such as motion segmentation, face and digit clustering. In face clustering, the subspaces are linear and subspace clustering methods can be applied directly. In motion segmentation, the subspaces are affine and an additional affine…
▽ More
Subspace clustering methods based on expressing each data point as a linear combination of other data points have achieved great success in computer vision applications such as motion segmentation, face and digit clustering. In face clustering, the subspaces are linear and subspace clustering methods can be applied directly. In motion segmentation, the subspaces are affine and an additional affine constraint on the coefficients is often enforced. However, since affine subspaces can always be embedded into linear subspaces of one extra dimension, it is unclear if the affine constraint is really necessary. This paper shows, both theoretically and empirically, that when the dimension of the ambient space is high relative to the sum of the dimensions of the affine subspaces, the affine constraint has a negligible effect on clustering performance. Specifically, our analysis provides conditions that guarantee the correctness of affine subspace clustering methods both with and without the affine constraint, and shows that these conditions are satisfied for high-dimensional data. Underlying our analysis is the notion of affinely independent subspaces, which not only provides geometrically interpretable correctness conditions, but also clarifies the relationships between existing results for affine subspace clustering.
△ Less
Submitted 8 May, 2020;
originally announced May 2020.
-
On dissipative symplectic integration with applications to gradient-based optimization
Authors:
Guilherme França,
Michael I. Jordan,
René Vidal
Abstract:
Recently, continuous-time dynamical systems have proved useful in providing conceptual and quantitative insights into gradient-based optimization, widely used in modern machine learning and statistics. An important question that arises in this line of work is how to discretize the system in such a way that its stability and rates of convergence are preserved. In this paper we propose a geometric f…
▽ More
Recently, continuous-time dynamical systems have proved useful in providing conceptual and quantitative insights into gradient-based optimization, widely used in modern machine learning and statistics. An important question that arises in this line of work is how to discretize the system in such a way that its stability and rates of convergence are preserved. In this paper we propose a geometric framework in which such discretizations can be realized systematically, enabling the derivation of "rate-matching" algorithms without the need for a discrete convergence analysis. More specifically, we show that a generalization of symplectic integrators to nonconservative and in particular dissipative Hamiltonian systems is able to preserve rates of convergence up to a controlled error. Moreover, such methods preserve a shadow Hamiltonian despite the absence of a conservation law, extending key results of symplectic integrators to nonconservative cases. Our arguments rely on a combination of backward error analysis with fundamental results from symplectic geometry. We stress that although the original motivation for this work was the application to optimization, where dissipative systems play a natural role, they are fully general and not only provide a differential geometric framework for dissipative Hamiltonian systems but also substantially extend the theory of structure-preserving integration.
△ Less
Submitted 28 April, 2021; v1 submitted 14 April, 2020;
originally announced April 2020.
-
Finding the Sparsest Vectors in a Subspace: Theory, Algorithms, and Applications
Authors:
Qing Qu,
Zhihui Zhu,
Xiao Li,
Manolis C. Tsakiris,
John Wright,
René Vidal
Abstract:
The problem of finding the sparsest vector (direction) in a low dimensional subspace can be considered as a homogeneous variant of the sparse recovery problem, which finds applications in robust subspace recovery, dictionary learning, sparse blind deconvolution, and many other problems in signal processing and machine learning. However, in contrast to the classical sparse recovery problem, the mos…
▽ More
The problem of finding the sparsest vector (direction) in a low dimensional subspace can be considered as a homogeneous variant of the sparse recovery problem, which finds applications in robust subspace recovery, dictionary learning, sparse blind deconvolution, and many other problems in signal processing and machine learning. However, in contrast to the classical sparse recovery problem, the most natural formulation for finding the sparsest vector in a subspace is usually nonconvex. In this paper, we overview recent advances on global nonconvex optimization theory for solving this problem, ranging from geometric analysis of its optimization landscapes, to efficient optimization algorithms for solving the associated nonconvex optimization problem, to applications in machine intelligence, representation learning, and imaging sciences. Finally, we conclude this review by pointing out several interesting open problems for future research.
△ Less
Submitted 19 January, 2020;
originally announced January 2020.
-
Basis Pursuit and Orthogonal Matching Pursuit for Subspace-preserving Recovery: Theoretical Analysis
Authors:
Daniel P. Robinson,
Rene Vidal,
Chong You
Abstract:
Given an overcomplete dictionary $A$ and a signal $b = Ac^*$ for some sparse vector $c^*$ whose nonzero entries correspond to linearly independent columns of $A$, classical sparse signal recovery theory considers the problem of whether $c^*$ can be recovered as the unique sparsest solution to $b = A c$. It is now well-understood that such recovery is possible by practical algorithms when the dicti…
▽ More
Given an overcomplete dictionary $A$ and a signal $b = Ac^*$ for some sparse vector $c^*$ whose nonzero entries correspond to linearly independent columns of $A$, classical sparse signal recovery theory considers the problem of whether $c^*$ can be recovered as the unique sparsest solution to $b = A c$. It is now well-understood that such recovery is possible by practical algorithms when the dictionary $A$ is incoherent or restricted isometric. In this paper, we consider the more general case where $b$ lies in a subspace $\mathcal{S}_0$ spanned by a subset of linearly dependent columns of $A$, and the remaining columns are outside of the subspace. In this case, the sparsest representation may not be unique, and the dictionary may not be incoherent or restricted isometric. The goal is to have the representation $c$ correctly identify the subspace, i.e. the nonzero entries of $c$ should correspond to columns of $A$ that are in the subspace $\mathcal{S}_0$. Such a representation $c$ is called subspace-preserving, a key concept that has found important applications for learning low-dimensional structures in high-dimensional data. We present various geometric conditions that guarantee subspace-preserving recovery. Among them, the major results are characterized by the covering radius and the angular distance, which capture the distribution of points in the subspace and the similarity between points in the subspace and points outside the subspace, respectively. Importantly, these conditions do not require the dictionary to be incoherent or restricted isometric. By establishing that the subspace-preserving recovery problem and the classical sparse signal recovery problem are equivalent under common assumptions on the latter, we show that several of our proposed conditions are generalizations of some well-known conditions in the sparse signal recovery literature.
△ Less
Submitted 30 December, 2019;
originally announced December 2019.
-
On the Regularization Properties of Structured Dropout
Authors:
Ambar Pal,
Connor Lane,
René Vidal,
Benjamin D. Haeffele
Abstract:
Dropout and its extensions (eg. DropBlock and DropConnect) are popular heuristics for training neural networks, which have been shown to improve generalization performance in practice. However, a theoretical understanding of their optimization and regularization properties remains elusive. Recent work shows that in the case of single hidden-layer linear networks, Dropout is a stochastic gradient d…
▽ More
Dropout and its extensions (eg. DropBlock and DropConnect) are popular heuristics for training neural networks, which have been shown to improve generalization performance in practice. However, a theoretical understanding of their optimization and regularization properties remains elusive. Recent work shows that in the case of single hidden-layer linear networks, Dropout is a stochastic gradient descent method for minimizing a regularized loss, and that the regularizer induces solutions that are low-rank and balanced. In this work we show that for single hidden-layer linear networks, DropBlock induces spectral k-support norm regularization, and promotes solutions that are low-rank and have factors with equal norm. We also show that the global minimizer for DropBlock can be computed in closed form, and that DropConnect is equivalent to Dropout. We then show that some of these results can be extended to a general class of Dropout-strategies, and, with some assumptions, to deep non-linear networks when Dropout is applied to the last layer. We verify our theoretical claims and assumptions experimentally with commonly used network architectures.
△ Less
Submitted 20 June, 2020; v1 submitted 30 October, 2019;
originally announced October 2019.
-
The fastest $\ell_{1,\infty}$ prox in the west
Authors:
Benjamín Béjar,
Ivan Dokmanić,
René Vidal
Abstract:
Proximal operators are of particular interest in optimization problems dealing with non-smooth objectives because in many practical cases they lead to optimization algorithms whose updates can be computed in closed form or very efficiently. A well-known example is the proximal operator of the vector $\ell_1$ norm, which is given by the soft-thresholding operator. In this paper we study the proxima…
▽ More
Proximal operators are of particular interest in optimization problems dealing with non-smooth objectives because in many practical cases they lead to optimization algorithms whose updates can be computed in closed form or very efficiently. A well-known example is the proximal operator of the vector $\ell_1$ norm, which is given by the soft-thresholding operator. In this paper we study the proximal operator of the mixed $\ell_{1,\infty}$ matrix norm and show that it can be computed in closed form by applying the well-known soft-thresholding operator to each column of the matrix. However, unlike the vector $\ell_1$ norm case where the threshold is constant, in the mixed $\ell_{1,\infty}$ norm case each column of the matrix might require a different threshold and all thresholds depend on the given matrix. We propose a general iterative algorithm for computing these thresholds, as well as two efficient implementations that further exploit easy to compute lower bounds for the mixed norm of the optimal solution. Experiments on large-scale synthetic and real data indicate that the proposed methods can be orders of magnitude faster than state-of-the-art methods.
△ Less
Submitted 8 October, 2019;
originally announced October 2019.
-
Gradient flows and proximal splitting methods: A unified view on accelerated and stochastic optimization
Authors:
Guilherme França,
Daniel P. Robinson,
René Vidal
Abstract:
Optimization is at the heart of machine learning, statistics and many applied scientific disciplines. It also has a long history in physics, ranging from the minimal action principle to finding ground states of disordered systems such as spin glasses. Proximal algorithms form a class of methods that are broadly applicable and are particularly well-suited to nonsmooth, constrained, large-scale, and…
▽ More
Optimization is at the heart of machine learning, statistics and many applied scientific disciplines. It also has a long history in physics, ranging from the minimal action principle to finding ground states of disordered systems such as spin glasses. Proximal algorithms form a class of methods that are broadly applicable and are particularly well-suited to nonsmooth, constrained, large-scale, and distributed optimization problems. There are essentially five proximal algorithms currently known: Forward-backward splitting, Tseng splitting, Douglas-Rachford, alternating direction method of multipliers, and the more recent Davis-Yin. These methods sit on a higher level of abstraction compared to gradient-based ones, with deep roots in nonlinear functional analysis. We show that all of these methods are actually different discretizations of a single differential equation, namely, the simple gradient flow which dates back to Cauchy (1847). An important aspect behind many of the success stories in machine learning relies on "accelerating" the convergence of first-order methods. We show that similar discretization schemes applied to Newton's equation with an additional dissipative force, which we refer to as accelerated gradient flow, allow us to obtain accelerated variants of all these proximal algorithms -- the majority of which are new although some recover known cases in the literature. Furthermore, we extend these methods to stochastic settings, allowing us to make connections with Langevin and Fokker-Planck equations. Similar ideas apply to gradient descent, heavy ball, and Nesterov's method which are simpler. Our results therefore provide a unified framework from which several important optimization methods are nothing but simulations of classical dissipative systems.
△ Less
Submitted 10 May, 2021; v1 submitted 2 August, 2019;
originally announced August 2019.
-
Conformal Symplectic and Relativistic Optimization
Authors:
Guilherme França,
Jeremias Sulam,
Daniel P. Robinson,
René Vidal
Abstract:
Arguably, the two most popular accelerated or momentum-based optimization methods in machine learning are Nesterov's accelerated gradient and Polyaks's heavy ball, both corresponding to different discretizations of a particular second order differential equation with friction. Such connections with continuous-time dynamical systems have been instrumental in demystifying acceleration phenomena in o…
▽ More
Arguably, the two most popular accelerated or momentum-based optimization methods in machine learning are Nesterov's accelerated gradient and Polyaks's heavy ball, both corresponding to different discretizations of a particular second order differential equation with friction. Such connections with continuous-time dynamical systems have been instrumental in demystifying acceleration phenomena in optimization. Here we study structure-preserving discretizations for a certain class of dissipative (conformal) Hamiltonian systems, allowing us to analyze the symplectic structure of both Nesterov and heavy ball, besides providing several new insights into these methods. Moreover, we propose a new algorithm based on a dissipative relativistic system that normalizes the momentum and may result in more stable/faster optimization. Importantly, such a method generalizes both Nesterov and heavy ball, each being recovered as distinct limiting cases, and has potential advantages at no additional cost.
△ Less
Submitted 24 December, 2020; v1 submitted 10 March, 2019;
originally announced March 2019.
-
A Nonsmooth Dynamical Systems Perspective on Accelerated Extensions of ADMM
Authors:
Guilherme França,
Daniel P. Robinson,
René Vidal
Abstract:
Recently, there has been great interest in connections between continuous-time dynamical systems and optimization methods, notably in the context of accelerated methods for smooth and unconstrained problems. In this paper we extend this perspective to nonsmooth and constrained problems by obtaining differential inclusions associated to novel accelerated variants of the alternating direction method…
▽ More
Recently, there has been great interest in connections between continuous-time dynamical systems and optimization methods, notably in the context of accelerated methods for smooth and unconstrained problems. In this paper we extend this perspective to nonsmooth and constrained problems by obtaining differential inclusions associated to novel accelerated variants of the alternating direction method of multipliers (ADMM). Through a Lyapunov analysis, we derive rates of convergence for these dynamical systems in different settings that illustrate an interesting tradeoff between decaying versus constant dam** strategies. We also obtain modified equations capturing fine-grained details of these methods, which have improved stability and preserve the leading order convergence rates. An extension to general nonlinear equality and inequality constraints in connection with singular perturbation theory is provided.
△ Less
Submitted 17 January, 2023; v1 submitted 12 August, 2018;
originally announced August 2018.
-
On the Implicit Bias of Dropout
Authors:
Poorya Mianjy,
Raman Arora,
Rene Vidal
Abstract:
Algorithmic approaches endow deep learning systems with implicit bias that helps them generalize even in over-parametrized settings. In this paper, we focus on understanding such a bias induced in learning through dropout, a popular technique to avoid overfitting in deep learning. For single hidden-layer linear neural networks, we show that dropout tends to make the norm of incoming/outgoing weigh…
▽ More
Algorithmic approaches endow deep learning systems with implicit bias that helps them generalize even in over-parametrized settings. In this paper, we focus on understanding such a bias induced in learning through dropout, a popular technique to avoid overfitting in deep learning. For single hidden-layer linear neural networks, we show that dropout tends to make the norm of incoming/outgoing weight vectors of all the hidden nodes equal. In addition, we provide a complete characterization of the optimization landscape induced by dropout.
△ Less
Submitted 25 June, 2018;
originally announced June 2018.
-
Theoretical Analysis of Sparse Subspace Clustering with Missing Entries
Authors:
Manolis C. Tsakiris,
Rene Vidal
Abstract:
Sparse Subspace Clustering (SSC) is a popular unsupervised machine learning method for clustering data lying close to an unknown union of low-dimensional linear subspaces; a problem with numerous applications in pattern recognition and computer vision. Even though the behavior of SSC for complete data is by now well-understood, little is known about its theoretical properties when applied to data…
▽ More
Sparse Subspace Clustering (SSC) is a popular unsupervised machine learning method for clustering data lying close to an unknown union of low-dimensional linear subspaces; a problem with numerous applications in pattern recognition and computer vision. Even though the behavior of SSC for complete data is by now well-understood, little is known about its theoretical properties when applied to data with missing entries. In this paper we give theoretical guarantees for SSC with incomplete data, and analytically establish that projecting the zero-filled data onto the observation pattern of the point being expressed leads to a substantial improvement in performance. The main insight that stems from our analysis is that even though the projection induces additional missing entries, this is counterbalanced by the fact that the projected and zero-filled data are in effect incomplete points associated with the union of the corresponding projected subspaces, with respect to which the point being expressed is complete. The significance of this phenomenon potentially extends to the entire class of self-expressive methods.
△ Less
Submitted 9 February, 2018; v1 submitted 31 December, 2017;
originally announced January 2018.
-
Dropout as a Low-Rank Regularizer for Matrix Factorization
Authors:
Jacopo Cavazza,
Pietro Morerio,
Benjamin Haeffele,
Connor Lane,
Vittorio Murino,
Rene Vidal
Abstract:
Regularization for matrix factorization (MF) and approximation problems has been carried out in many different ways. Due to its popularity in deep learning, dropout has been applied also for this class of problems. Despite its solid empirical performance, the theoretical properties of dropout as a regularizer remain quite elusive for this class of problems. In this paper, we present a theoretical…
▽ More
Regularization for matrix factorization (MF) and approximation problems has been carried out in many different ways. Due to its popularity in deep learning, dropout has been applied also for this class of problems. Despite its solid empirical performance, the theoretical properties of dropout as a regularizer remain quite elusive for this class of problems. In this paper, we present a theoretical analysis of dropout for MF, where Bernoulli random variables are used to drop columns of the factors. We demonstrate the equivalence between dropout and a fully deterministic model for MF in which the factors are regularized by the sum of the product of squared Euclidean norms of the columns. Additionally, we inspect the case of a variable sized factorization and we prove that dropout achieves the global minimum of a convex approximation problem with (squared) nuclear norm regularization. As a result, we conclude that dropout can be used as a low-rank regularizer with data dependent singular-value thresholding.
△ Less
Submitted 13 October, 2017;
originally announced October 2017.
-
An Analysis of Dropout for Matrix Factorization
Authors:
Jacopo Cavazza,
Connor Lane,
Benjamin D. Haeffele,
Vittorio Murino,
René Vidal
Abstract:
Dropout is a simple yet effective algorithm for regularizing neural networks by randomly drop** out units through Bernoulli multiplicative noise, and for some restricted problem classes, such as linear or logistic regression, several theoretical studies have demonstrated the equivalence between dropout and a fully deterministic optimization problem with data-dependent Tikhonov regularization. Th…
▽ More
Dropout is a simple yet effective algorithm for regularizing neural networks by randomly drop** out units through Bernoulli multiplicative noise, and for some restricted problem classes, such as linear or logistic regression, several theoretical studies have demonstrated the equivalence between dropout and a fully deterministic optimization problem with data-dependent Tikhonov regularization. This work presents a theoretical analysis of dropout for matrix factorization, where Bernoulli random variables are used to drop a factor, thereby attempting to control the size of the factorization. While recent work has demonstrated the empirical effectiveness of dropout for matrix factorization, a theoretical understanding of the regularization properties of dropout in this context remains elusive. This work demonstrates the equivalence between dropout and a fully deterministic model for matrix factorization in which the factors are regularized by the sum of the product of the norms of the columns. While the resulting regularizer is closely related to a variational form of the nuclear norm, suggesting that dropout may limit the size of the factorization, we show that it is possible to trivially lower the objective value by doubling the size of the factorization. We show that this problem is caused by the use of a fixed dropout rate, which motivates the use of a rate that increases with the size of the factorization. Synthetic experiments validate our theoretical findings.
△ Less
Submitted 10 October, 2017;
originally announced October 2017.
-
Hyperplane Clustering Via Dual Principal Component Pursuit
Authors:
Manolis C. Tsakiris,
Rene Vidal
Abstract:
We extend the theoretical analysis of a recently proposed single subspace learning algorithm, called Dual Principal Component Pursuit (DPCP), to the case where the data are drawn from of a union of hyperplanes. To gain insight into the properties of the $\ell_1$ non-convex problem associated with DPCP, we develop a geometric analysis of a closely related continuous optimization problem. Then trans…
▽ More
We extend the theoretical analysis of a recently proposed single subspace learning algorithm, called Dual Principal Component Pursuit (DPCP), to the case where the data are drawn from of a union of hyperplanes. To gain insight into the properties of the $\ell_1$ non-convex problem associated with DPCP, we develop a geometric analysis of a closely related continuous optimization problem. Then transferring this analysis to the discrete problem, our results state that as long as the hyperplanes are sufficiently separated, the dominant hyperplane is sufficiently dominant and the points are uniformly distributed inside the associated hyperplanes, then the non-convex DPCP problem has a unique global solution, equal to the normal vector of the dominant hyperplane. This suggests the correctness of a sequential hyperplane learning algorithm based on DPCP. A thorough experimental evaluation reveals that hyperplane learning schemes based on DPCP dramatically improve over the state-of-the-art methods for the case of synthetic data, while are competitive to the state-of-the-art in the case of 3D plane clustering for Kinect data.
△ Less
Submitted 19 June, 2017; v1 submitted 6 June, 2017;
originally announced June 2017.
-
Provable Self-Representation Based Outlier Detection in a Union of Subspaces
Authors:
Chong You,
Daniel P. Robinson,
René Vidal
Abstract:
Many computer vision tasks involve processing large amounts of data contaminated by outliers, which need to be detected and rejected. While outlier detection methods based on robust statistics have existed for decades, only recently have methods based on sparse and low-rank representation been developed along with guarantees of correct outlier detection when the inliers lie in one or more low-dime…
▽ More
Many computer vision tasks involve processing large amounts of data contaminated by outliers, which need to be detected and rejected. While outlier detection methods based on robust statistics have existed for decades, only recently have methods based on sparse and low-rank representation been developed along with guarantees of correct outlier detection when the inliers lie in one or more low-dimensional subspaces. This paper proposes a new outlier detection method that combines tools from sparse representation with random walks on a graph. By exploiting the property that data points can be expressed as sparse linear combinations of each other, we obtain an asymmetric affinity matrix among data points, which we use to construct a weighted directed graph. By defining a suitable Markov Chain from this graph, we establish a connection between inliers/outliers and essential/inessential states of the Markov chain, which allows us to detect outliers by using random walks. We provide a theoretical analysis that justifies the correctness of our method under geometric and connectivity assumptions. Experimental results on image databases demonstrate its superiority with respect to state-of-the-art sparse and low-rank outlier detection methods.
△ Less
Submitted 12 April, 2017;
originally announced April 2017.
-
Curriculum Dropout
Authors:
Pietro Morerio,
Jacopo Cavazza,
Riccardo Volpi,
Rene Vidal,
Vittorio Murino
Abstract:
Dropout is a very effective way of regularizing neural networks. Stochastically "drop** out" units with a certain probability discourages over-specific co-adaptations of feature detectors, preventing overfitting and improving network generalization. Besides, Dropout can be interpreted as an approximate model aggregation technique, where an exponential number of smaller networks are averaged in o…
▽ More
Dropout is a very effective way of regularizing neural networks. Stochastically "drop** out" units with a certain probability discourages over-specific co-adaptations of feature detectors, preventing overfitting and improving network generalization. Besides, Dropout can be interpreted as an approximate model aggregation technique, where an exponential number of smaller networks are averaged in order to get a more powerful ensemble. In this paper, we show that using a fixed dropout probability during training is a suboptimal choice. We thus propose a time scheduling for the probability of retaining neurons in the network. This induces an adaptive regularization scheme that smoothly increases the difficulty of the optimization problem. This idea of "starting easy" and adaptively increasing the difficulty of the learning problem has its roots in curriculum learning and allows one to train better models. Indeed, we prove that our optimization strategy implements a very general curriculum scheme, by gradually adding noise to both the input and intermediate feature representations within the network architecture. Experiments on seven image classification datasets and different network architectures show that our method, named Curriculum Dropout, frequently yields to better generalization and, at worst, performs just as well as the standard Dropout method.
△ Less
Submitted 3 August, 2017; v1 submitted 17 March, 2017;
originally announced March 2017.
-
Information Pursuit: A Bayesian Framework for Sequential Scene Parsing
Authors:
Ehsan Jahangiri,
Erdem Yoruk,
Rene Vidal,
Laurent Younes,
Donald Geman
Abstract:
Despite enormous progress in object detection and classification, the problem of incorporating expected contextual relationships among object instances into modern recognition systems remains a key challenge. In this work we propose Information Pursuit, a Bayesian framework for scene parsing that combines prior models for the geometry of the scene and the spatial arrangement of objects instances w…
▽ More
Despite enormous progress in object detection and classification, the problem of incorporating expected contextual relationships among object instances into modern recognition systems remains a key challenge. In this work we propose Information Pursuit, a Bayesian framework for scene parsing that combines prior models for the geometry of the scene and the spatial arrangement of objects instances with a data model for the output of high-level image classifiers trained to answer specific questions about the scene. In the proposed framework, the scene interpretation is progressively refined as evidence accumulates from the answers to a sequence of questions. At each step, we choose the question to maximize the mutual information between the new answer and the full interpretation given the current evidence obtained from previous inquiries. We also propose a method for learning the parameters of the model from synthesized, annotated scenes obtained by top-down sampling from an easy-to-learn generative scene model. Finally, we introduce a database of annotated indoor scenes of dining room tables, which we use to evaluate the proposed approach.
△ Less
Submitted 9 January, 2017;
originally announced January 2017.
-
Joint Spatial-Angular Sparse Coding for dMRI with Separable Dictionaries
Authors:
Evan Schwab,
René Vidal,
Nicolas Charon
Abstract:
Diffusion MRI (dMRI) provides the ability to reconstruct neuronal fibers in the brain, $\textit{in vivo}$, by measuring water diffusion along angular gradient directions in q-space. High angular resolution diffusion imaging (HARDI) can produce better estimates of fiber orientation than the popularly used diffusion tensor imaging, but the high number of samples needed to estimate diffusivity requir…
▽ More
Diffusion MRI (dMRI) provides the ability to reconstruct neuronal fibers in the brain, $\textit{in vivo}$, by measuring water diffusion along angular gradient directions in q-space. High angular resolution diffusion imaging (HARDI) can produce better estimates of fiber orientation than the popularly used diffusion tensor imaging, but the high number of samples needed to estimate diffusivity requires longer patient scan times. To accelerate dMRI, compressed sensing (CS) has been utilized by exploiting a sparse dictionary representation of the data, discovered through sparse coding. The sparser the representation, the fewer samples are needed to reconstruct a high resolution signal with limited information loss, and so an important area of research has focused on finding the sparsest possible representation of dMRI. Current reconstruction methods however, rely on an angular representation $\textit{per voxel}$ with added spatial regularization, and so, for non-zero signals, one is required to have at least one non-zero coefficient per voxel. This means that the global level of sparsity must be greater than the number of voxels. In contrast, we propose a joint spatial-angular representation of dMRI that will allow us to achieve levels of global sparsity that are below the number of voxels. A major challenge, however, is the computational complexity of solving a global sparse coding problem over large-scale dMRI. In this work, we present novel adaptations of popular sparse coding algorithms that become better suited for solving large-scale problems by exploiting spatial-angular separability. Our experiments show that our method achieves significantly sparser representations of HARDI than is possible by the state of the art.
△ Less
Submitted 29 May, 2018; v1 submitted 17 December, 2016;
originally announced December 2016.
-
Oracle Based Active Set Algorithm for Scalable Elastic Net Subspace Clustering
Authors:
Chong You,
Chun-Guang Li,
Daniel P. Robinson,
Rene Vidal
Abstract:
State-of-the-art subspace clustering methods are based on expressing each data point as a linear combination of other data points while regularizing the matrix of coefficients with $\ell_1$, $\ell_2$ or nuclear norms. $\ell_1$ regularization is guaranteed to give a subspace-preserving affinity (i.e., there are no connections between points from different subspaces) under broad theoretical conditio…
▽ More
State-of-the-art subspace clustering methods are based on expressing each data point as a linear combination of other data points while regularizing the matrix of coefficients with $\ell_1$, $\ell_2$ or nuclear norms. $\ell_1$ regularization is guaranteed to give a subspace-preserving affinity (i.e., there are no connections between points from different subspaces) under broad theoretical conditions, but the clusters may not be connected. $\ell_2$ and nuclear norm regularization often improve connectivity, but give a subspace-preserving affinity only for independent subspaces. Mixed $\ell_1$, $\ell_2$ and nuclear norm regularizations offer a balance between the subspace-preserving and connectedness properties, but this comes at the cost of increased computational complexity. This paper studies the geometry of the elastic net regularizer (a mixture of the $\ell_1$ and $\ell_2$ norms) and uses it to derive a provably correct and scalable active set method for finding the optimal coefficients. Our geometric analysis also provides a theoretical justification and a geometric interpretation for the balance between the connectedness (due to $\ell_2$ regularization) and subspace-preserving (due to $\ell_1$ regularization) properties for elastic net subspace clustering. Our experiments show that the proposed active set method not only achieves state-of-the-art clustering performance, but also efficiently handles large-scale datasets.
△ Less
Submitted 9 May, 2016;
originally announced May 2016.
-
Subspace-Sparse Representation
Authors:
C. You,
R. Vidal
Abstract:
Given an overcomplete dictionary $A$ and a signal $b$ that is a linear combination of a few linearly independent columns of $A$, classical sparse recovery theory deals with the problem of recovering the unique sparse representation $x$ such that $b = A x$. It is known that under certain conditions on $A$, $x$ can be recovered by the Basis Pursuit (BP) and the Orthogonal Matching Pursuit (OMP) algo…
▽ More
Given an overcomplete dictionary $A$ and a signal $b$ that is a linear combination of a few linearly independent columns of $A$, classical sparse recovery theory deals with the problem of recovering the unique sparse representation $x$ such that $b = A x$. It is known that under certain conditions on $A$, $x$ can be recovered by the Basis Pursuit (BP) and the Orthogonal Matching Pursuit (OMP) algorithms. In this work, we consider the more general case where $b$ lies in a low-dimensional subspace spanned by some columns of $A$, which are possibly linearly dependent. In this case, the sparsest solution $x$ is generally not unique, and we study the problem that the representation $x$ identifies the subspace, i.e. the nonzero entries of $x$ correspond to dictionary atoms that are in the subspace. Such a representation $x$ is called subspace-sparse. We present sufficient conditions for guaranteeing subspace-sparse recovery, which have clear geometric interpretations and explain properties of subspace-sparse recovery. We also show that the sufficient conditions can be satisfied under a randomized model. Our results are applicable to the traditional sparse recovery problem and we get conditions for sparse recovery that are less restrictive than the canonical mutual coherent condition. We also use the results to analyze the sparse representation based classification (SRC) method, for which we get conditions to show its correctness.
△ Less
Submitted 5 July, 2015;
originally announced July 2015.
-
Scalable Sparse Subspace Clustering by Orthogonal Matching Pursuit
Authors:
Chong You,
Daniel P. Robinson,
Rene Vidal
Abstract:
Subspace clustering methods based on $\ell_1$, $\ell_2$ or nuclear norm regularization have become very popular due to their simplicity, theoretical guarantees and empirical success. However, the choice of the regularizer can greatly impact both theory and practice. For instance, $\ell_1$ regularization is guaranteed to give a subspace-preserving affinity (i.e., there are no connections between po…
▽ More
Subspace clustering methods based on $\ell_1$, $\ell_2$ or nuclear norm regularization have become very popular due to their simplicity, theoretical guarantees and empirical success. However, the choice of the regularizer can greatly impact both theory and practice. For instance, $\ell_1$ regularization is guaranteed to give a subspace-preserving affinity (i.e., there are no connections between points from different subspaces) under broad conditions (e.g., arbitrary subspaces and corrupted data). However, it requires solving a large scale convex optimization problem. On the other hand, $\ell_2$ and nuclear norm regularization provide efficient closed form solutions, but require very strong assumptions to guarantee a subspace-preserving affinity, e.g., independent subspaces and uncorrupted data. In this paper we study a subspace clustering method based on orthogonal matching pursuit. We show that the method is both computationally efficient and guaranteed to give a subspace-preserving affinity under broad conditions. Experiments on synthetic data verify our theoretical analysis, and applications in handwritten digit and face clustering show that our approach achieves the best trade off between accuracy and efficiency.
△ Less
Submitted 5 May, 2016; v1 submitted 5 July, 2015;
originally announced July 2015.
-
Global Optimality in Tensor Factorization, Deep Learning, and Beyond
Authors:
Benjamin D. Haeffele,
Rene Vidal
Abstract:
Techniques involving factorization are found in a wide range of applications and have enjoyed significant empirical success in many fields. However, common to a vast majority of these problems is the significant disadvantage that the associated optimization problems are typically non-convex due to a multilinear form or other convexity destroying transformation. Here we build on ideas from convex r…
▽ More
Techniques involving factorization are found in a wide range of applications and have enjoyed significant empirical success in many fields. However, common to a vast majority of these problems is the significant disadvantage that the associated optimization problems are typically non-convex due to a multilinear form or other convexity destroying transformation. Here we build on ideas from convex relaxations of matrix factorizations and present a very general framework which allows for the analysis of a wide range of non-convex factorization problems - including matrix factorization, tensor factorization, and deep neural network training formulations. We derive sufficient conditions to guarantee that a local minimum of the non-convex optimization problem is a global minimum and show that if the size of the factorized variables is large enough then from any initialization it is possible to find a global minimizer using a purely local descent algorithm. Our framework also provides a partial theoretical justification for the increasingly common use of Rectified Linear Units (ReLUs) in deep neural networks and offers guidance on deep network architectures and regularization strategies to facilitate efficient optimization.
△ Less
Submitted 24 June, 2015;
originally announced June 2015.
-
Sparse Subspace Clustering: Algorithm, Theory, and Applications
Authors:
Ehsan Elhamifar,
Rene Vidal
Abstract:
In many real-world problems, we are dealing with collections of high-dimensional data, such as images, videos, text and web documents, DNA microarray data, and more. Often, high-dimensional data lie close to low-dimensional structures corresponding to several classes or categories the data belongs to. In this paper, we propose and study an algorithm, called Sparse Subspace Clustering (SSC), to clu…
▽ More
In many real-world problems, we are dealing with collections of high-dimensional data, such as images, videos, text and web documents, DNA microarray data, and more. Often, high-dimensional data lie close to low-dimensional structures corresponding to several classes or categories the data belongs to. In this paper, we propose and study an algorithm, called Sparse Subspace Clustering (SSC), to cluster data points that lie in a union of low-dimensional subspaces. The key idea is that, among infinitely many possible representations of a data point in terms of other points, a sparse representation corresponds to selecting a few points from the same subspace. This motivates solving a sparse optimization program whose solution is used in a spectral clustering framework to infer the clustering of data into subspaces. Since solving the sparse optimization program is in general NP-hard, we consider a convex relaxation and show that, under appropriate conditions on the arrangement of subspaces and the distribution of data, the proposed minimization program succeeds in recovering the desired sparse representations. The proposed algorithm can be solved efficiently and can handle data points near the intersections of subspaces. Another key advantage of the proposed algorithm with respect to the state of the art is that it can deal with data nuisances, such as noise, sparse outlying entries, and missing entries, directly by incorporating the model of the data into the sparse optimization program. We demonstrate the effectiveness of the proposed algorithm through experiments on synthetic data as well as the two real-world problems of motion segmentation and face clustering.
△ Less
Submitted 4 February, 2013; v1 submitted 5 March, 2012;
originally announced March 2012.