Search | arXiv e-print repository

Optimal Activation Functions for the Random Features Regression Model

Abstract: The asymptotic mean squared test error and sensitivity of the Random Features Regression model (RFR) have been recently studied. We build on this work and identify in closed-form the family of Activation Functions (AFs) that minimize a combination of the test error and sensitivity of the RFR under different notions of functional parsimony. We find scenarios under which the optimal AFs are linear,… ▽ More The asymptotic mean squared test error and sensitivity of the Random Features Regression model (RFR) have been recently studied. We build on this work and identify in closed-form the family of Activation Functions (AFs) that minimize a combination of the test error and sensitivity of the RFR under different notions of functional parsimony. We find scenarios under which the optimal AFs are linear, saturated linear functions, or expressible in terms of Hermite polynomials. Finally, we show how using optimal AFs impacts well-established properties of the RFR model, such as its double descent curve, and the dependency of its optimal regularization parameter on the observation noise level. △ Less

Submitted 24 March, 2023; v1 submitted 31 May, 2022; originally announced June 2022.

arXiv:2009.02604 [pdf, other]

Distributed Optimization, Averaging via ADMM, and Network Topology

Authors: Guilherme França, José Bento

Abstract: There has been an increasing necessity for scalable optimization methods, especially due to the explosion in the size of datasets and model complexity in modern machine learning applications. Scalable solvers often distribute the computation over a network of processing units. For simple algorithms such as gradient descent the dependency of the convergence time with the topology of this network is… ▽ More There has been an increasing necessity for scalable optimization methods, especially due to the explosion in the size of datasets and model complexity in modern machine learning applications. Scalable solvers often distribute the computation over a network of processing units. For simple algorithms such as gradient descent the dependency of the convergence time with the topology of this network is well-known. However, for more involved algorithms such as the Alternating Direction Methods of Multipliers (ADMM) much less is known. At the heart of many distributed optimization algorithms there exists a gossip subroutine which averages local information over the network, and whose efficiency is crucial for the overall performance of the method. In this paper we review recent research in this area and, with the goal of isolating such a communication exchange behaviour, we compare different algorithms when applied to a canonical distributed averaging consensus problem. We also show interesting connections between ADMM and lifted Markov chains besides providing an explicitly characterization of its convergence and optimal parameter tuning in terms of spectral properties of the network. Finally, we empirically study the connection between network topology and convergence rates for different algorithms on a real world problem of sensor localization. △ Less

Submitted 5 September, 2020; originally announced September 2020.

Comments: to appear in "Proceedings of the IEEE"

arXiv:2001.11114 [pdf, other]

doi 10.1007/s10994-022-06280-y

A Family of Pairwise Multi-Marginal Optimal Transports that Define a Generalized Metric

Authors: Liang Mi, Azadeh Sheikholeslami, José Bento

Abstract: The Optimal transport (OT) problem is rapidly finding its way into machine learning. Favoring its use are its metric properties. Many problems admit solutions with guarantees only for objects embedded in metric spaces, and the use of non-metrics can complicate solving them. Multi-marginal OT (MMOT) generalizes OT to simultaneously transporting multiple distributions. It captures important relation… ▽ More The Optimal transport (OT) problem is rapidly finding its way into machine learning. Favoring its use are its metric properties. Many problems admit solutions with guarantees only for objects embedded in metric spaces, and the use of non-metrics can complicate solving them. Multi-marginal OT (MMOT) generalizes OT to simultaneously transporting multiple distributions. It captures important relations that are missed if the transport only involves two distributions. Research on MMOT, however, has been focused on its existence, uniqueness, practical algorithms, and the choice of cost functions. There is a lack of discussion on the metric properties of MMOT, which limits its theoretical and practical use. Here, we prove new generalized metric properties for a family of pairwise MMOTs. We first explain the difficulty of proving this via two negative results. Afterward, we prove the MMOTs' metric properties. Finally, we show that the generalized triangle inequality of this family of MMOTs cannot be improved. We illustrate the superiority of our MMOTs over other generalized metrics, and over non-metrics in both synthetic and real tasks. △ Less

Submitted 22 December, 2022; v1 submitted 29 January, 2020; originally announced January 2020.

Comments: Machine Learning (2022)

arXiv:1801.04987 [pdf, other]

doi 10.1109/LSP.2018.2867800

On the Complexity of the Weighted Fused Lasso

Authors: Jose Bento, Ralph Furmaniak, Surjyendu Ray

Abstract: The solution path of the 1D fused lasso for an $n$-dimensional input is piecewise linear with $\mathcal{O}(n)$ segments (Hoefling et al. 2010 and Tibshirani et al 2011). However, existing proofs of this bound do not hold for the weighted fused lasso. At the same time, results for the generalized lasso, of which the weighted fused lasso is a special case, allow $Ω(3^n)$ segments (Mairal et al. 2012… ▽ More The solution path of the 1D fused lasso for an $n$-dimensional input is piecewise linear with $\mathcal{O}(n)$ segments (Hoefling et al. 2010 and Tibshirani et al 2011). However, existing proofs of this bound do not hold for the weighted fused lasso. At the same time, results for the generalized lasso, of which the weighted fused lasso is a special case, allow $Ω(3^n)$ segments (Mairal et al. 2012). In this paper, we prove that the number of segments in the solution path of the weighted fused lasso is $\mathcal{O}(n^2)$, and that, for some instances, it is $Ω(n^2)$. We also give a new, very simple, proof of the $\mathcal{O}(n)$ bound for the fused lasso. △ Less

Submitted 19 April, 2018; v1 submitted 15 January, 2018; originally announced January 2018.

arXiv:1710.00889 [pdf, other]

How is Distributed ADMM Affected by Network Topology?

Authors: Guilherme França, José Bento

Abstract: When solving consensus optimization problems over a graph, there is often an explicit characterization of the convergence rate of Gradient Descent (GD) using the spectrum of the graph Laplacian. The same type of problems under the Alternating Direction Method of Multipliers (ADMM) are, however, poorly understood. For instance, simple but important non-strongly-convex consensus problems have not ye… ▽ More When solving consensus optimization problems over a graph, there is often an explicit characterization of the convergence rate of Gradient Descent (GD) using the spectrum of the graph Laplacian. The same type of problems under the Alternating Direction Method of Multipliers (ADMM) are, however, poorly understood. For instance, simple but important non-strongly-convex consensus problems have not yet being analyzed, especially concerning the dependency of the convergence rate on the graph topology. Recently, for a non-strongly-convex consensus problem, a connection between distributed ADMM and lifted Markov chains was proposed, followed by a conjecture that ADMM is faster than GD by a square root factor in its convergence time, in close analogy to the mixing speedup achieved by lifting several Markov chains. Nevertheless, a proof of such a claim is is still lacking. Here we provide a full characterization of the convergence of distributed over-relaxed ADMM for the same type of consensus problem in terms of the topology of the underlying graph. Our results provide explicit formulas for optimal parameter selection in terms of the second largest eigenvalue of the transition matrix of the graph's random walk. Another consequence of our results is a proof of the aforementioned conjecture, which interestingly, we show it is valid for any graph, even the ones whose random walks cannot be accelerated via Markov chain lifting. △ Less

Submitted 2 October, 2017; originally announced October 2017.

arXiv:1703.03863 [pdf, other]

Tuning Over-Relaxed ADMM

Authors: Guilherme França, José Bento

Abstract: The framework of Integral Quadratic Constraints (IQC) reduces the computation of upper bounds on the convergence rate of several optimization algorithms to a semi-definite program (SDP). In the case of over-relaxed Alternating Direction Method of Multipliers (ADMM), an explicit and closed form solution to this SDP was derived in our recent work [1]. The purpose of this paper is twofold. First, we… ▽ More The framework of Integral Quadratic Constraints (IQC) reduces the computation of upper bounds on the convergence rate of several optimization algorithms to a semi-definite program (SDP). In the case of over-relaxed Alternating Direction Method of Multipliers (ADMM), an explicit and closed form solution to this SDP was derived in our recent work [1]. The purpose of this paper is twofold. First, we summarize these results. Second, we explore one of its consequences which allows us to obtain general and simple formulas for optimal parameter selection. These results are valid for arbitrary strongly convex objective functions. △ Less

Submitted 5 March, 2018; v1 submitted 10 March, 2017; originally announced March 2017.

Comments: NIPS 2016, Optimizing the Optimizer Workshop

arXiv:1703.03859 [pdf, other]

doi 10.1109/LSP.2017.2654860

Markov Chain Lifting and Distributed ADMM

Authors: Guilherme França, José Bento

Abstract: The time to converge to the steady state of a finite Markov chain can be greatly reduced by a lifting operation, which creates a new Markov chain on an expanded state space. For a class of quadratic objectives, we show an analogous behavior where a distributed ADMM algorithm can be seen as a lifting of Gradient Descent algorithm. This provides a deep insight for its faster convergence rate under o… ▽ More The time to converge to the steady state of a finite Markov chain can be greatly reduced by a lifting operation, which creates a new Markov chain on an expanded state space. For a class of quadratic objectives, we show an analogous behavior where a distributed ADMM algorithm can be seen as a lifting of Gradient Descent algorithm. This provides a deep insight for its faster convergence rate under optimal parameter tuning. We conjecture that this gain is always present, as opposed to the lifting of a Markov chain which sometimes only provides a marginal speedup. △ Less

Submitted 10 March, 2017; originally announced March 2017.

Comments: This work was also selected for a talk at NIPS 2016, Optimization for Machine Learning Workshop (OPT 2016)

Journal ref: IEEE Signal Processing Letters (Volume: 24, Issue: 3, March 2017)

arXiv:1702.07956 [pdf, ps, other]

Generative Adversarial Active Learning

Authors: Jia-Jie Zhu, José Bento

Abstract: We propose a new active learning by query synthesis approach using Generative Adversarial Networks (GAN). Different from regular active learning, the resulting algorithm adaptively synthesizes training instances for querying to increase learning speed. We generate queries according to the uncertainty principle, but our idea can work with other active learning principles. We report results from var… ▽ More We propose a new active learning by query synthesis approach using Generative Adversarial Networks (GAN). Different from regular active learning, the resulting algorithm adaptively synthesizes training instances for querying to increase learning speed. We generate queries according to the uncertainty principle, but our idea can work with other active learning principles. We report results from various numerical experiments to demonstrate the effectiveness the proposed approach. In some settings, the proposed algorithm outperforms traditional pool-based approaches. To the best our knowledge, this is the first active learning work using GAN. △ Less

Submitted 15 November, 2017; v1 submitted 25 February, 2017; originally announced February 2017.

arXiv:1512.02063 [pdf, other]

doi 10.1109/ISIT.2016.7541670

An Explicit Rate Bound for the Over-Relaxed ADMM

Authors: Guilherme França, José Bento

Abstract: The framework of Integral Quadratic Constraints of Lessard et al. (2014) reduces the computation of upper bounds on the convergence rate of several optimization algorithms to semi-definite programming (SDP). Followup work by Nishihara et al. (2015) applies this technique to the entire family of over-relaxed Alternating Direction Method of Multipliers (ADMM). Unfortunately, they only provide an exp… ▽ More The framework of Integral Quadratic Constraints of Lessard et al. (2014) reduces the computation of upper bounds on the convergence rate of several optimization algorithms to semi-definite programming (SDP). Followup work by Nishihara et al. (2015) applies this technique to the entire family of over-relaxed Alternating Direction Method of Multipliers (ADMM). Unfortunately, they only provide an explicit error bound for sufficiently large values of some of the parameters of the problem, leaving the computation for the general case as a numerical optimization problem. In this paper we provide an exact analytical solution to this SDP and obtain a general and explicit upper bound on the convergence rate of the entire family of over-relaxed ADMM. Furthermore, we demonstrate that it is not possible to extract from this SDP a general bound better than ours. We end with a few numerical illustrations of our result and a comparison between the convergence rate we obtain for the ADMM with known convergence rates for the Gradient Descent. △ Less

Submitted 5 March, 2018; v1 submitted 7 December, 2015; originally announced December 2015.

Comments: IEEE International Symposium on Information Theory (ISIT), 2016

arXiv:1505.02867 [pdf, other]

The Boundary Forest Algorithm for Online Supervised and Unsupervised Learning

Authors: Charles Mathy, Nate Derbinsky, José Bento, Jonathan Rosenthal, Jonathan Yedidia

Abstract: We describe a new instance-based learning algorithm called the Boundary Forest (BF) algorithm, that can be used for supervised and unsupervised learning. The algorithm builds a forest of trees whose nodes store previously seen examples. It can be shown data points one at a time and updates itself incrementally, hence it is naturally online. Few instance-based algorithms have this property while be… ▽ More We describe a new instance-based learning algorithm called the Boundary Forest (BF) algorithm, that can be used for supervised and unsupervised learning. The algorithm builds a forest of trees whose nodes store previously seen examples. It can be shown data points one at a time and updates itself incrementally, hence it is naturally online. Few instance-based algorithms have this property while being simultaneously fast, which the BF is. This is crucial for applications where one needs to respond to input data in real time. The number of children of each node is not set beforehand but obtained from the training procedure, which makes the algorithm very flexible with regards to what data manifolds it can learn. We test its generalization performance and speed on a range of benchmark datasets and detail in which settings it outperforms the state of the art. Empirically we find that training time scales as O(DNlog(N)) and testing as O(Dlog(N)), where D is the dimensionality and N the amount of data, △ Less

Submitted 11 May, 2015; originally announced May 2015.

Comments: 7 pages, 4 figs, 1 page supp. info

Journal ref: Proc. of the 29th AAAI Conference on Artificial Intelligence (AAAI), 2864-2870. Austin, TX, USA. (2015)

arXiv:1207.6379 [pdf, ps, other]

Identifying Users From Their Rating Patterns

Authors: José Bento, Nadia Fawaz, Andrea Montanari, Stratis Ioannidis

Abstract: This paper reports on our analysis of the 2011 CAMRa Challenge dataset (Track 2) for context-aware movie recommendation systems. The train dataset comprises 4,536,891 ratings provided by 171,670 users on 23,974$ movies, as well as the household grou**s of a subset of the users. The test dataset comprises 5,450 ratings for which the user label is missing, but the household label is provided. The… ▽ More This paper reports on our analysis of the 2011 CAMRa Challenge dataset (Track 2) for context-aware movie recommendation systems. The train dataset comprises 4,536,891 ratings provided by 171,670 users on 23,974$ movies, as well as the household grou**s of a subset of the users. The test dataset comprises 5,450 ratings for which the user label is missing, but the household label is provided. The challenge required to identify the user labels for the ratings in the test set. Our main finding is that temporal information (time labels of the ratings) is significantly more useful for achieving this objective than the user preferences (the actual ratings). Using a model that leverages on this fact, we are able to identify users within a known household with an accuracy of approximately 96% (i.e. misclassification rate around 4%). △ Less

Submitted 26 July, 2012; originally announced July 2012.

Comments: Winner of the 2011 Challenge on Context-Aware Movie Recommendation (RecSys 2011 - CAMRa2011)

arXiv:1110.1769 [pdf, ps, other]

On the trade-off between complexity and correlation decay in structural learning algorithms

Authors: José Bento, Andrea Montanari

Abstract: We consider the problem of learning the structure of Ising models (pairwise binary Markov random fields) from i.i.d. samples. While several methods have been proposed to accomplish this task, their relative merits and limitations remain somewhat obscure. By analyzing a number of concrete examples, we show that low-complexity algorithms often fail when the Markov random field develops long-range co… ▽ More We consider the problem of learning the structure of Ising models (pairwise binary Markov random fields) from i.i.d. samples. While several methods have been proposed to accomplish this task, their relative merits and limitations remain somewhat obscure. By analyzing a number of concrete examples, we show that low-complexity algorithms often fail when the Markov random field develops long-range correlations. More precisely, this phenomenon appears to be related to the Ising model phase transition (although it does not coincide with it). △ Less

Submitted 8 October, 2011; originally announced October 2011.

arXiv:1103.1689 [pdf, other]

Information Theoretic Limits on Learning Stochastic Differential Equations

Authors: José Bento, Morteza Ibrahimi, Andrea Montanari

Abstract: Consider the problem of learning the drift coefficient of a stochastic differential equation from a sample path. In this paper, we assume that the drift is parametrized by a high dimensional vector. We address the question of how long the system needs to be observed in order to learn this vector of parameters. We prove a general lower bound on this time complexity by using a characterization of mu… ▽ More Consider the problem of learning the drift coefficient of a stochastic differential equation from a sample path. In this paper, we assume that the drift is parametrized by a high dimensional vector. We address the question of how long the system needs to be observed in order to learn this vector of parameters. We prove a general lower bound on this time complexity by using a characterization of mutual information as time integral of conditional variance, due to Kadota, Zakai, and Ziv. This general lower bound is applied to specific classes of linear and non-linear stochastic differential equations. In the linear case, the problem under consideration is the one of learning a matrix of interaction coefficients. We evaluate our lower bound for ensembles of sparse and dense random matrices. The resulting estimates match the qualitative behavior of upper bounds achieved by computationally efficient procedures. △ Less

Submitted 8 March, 2011; originally announced March 2011.

Comments: 6 pages, 2 figures, conference version

arXiv:0910.5761 [pdf, ps, other]

Which graphical models are difficult to learn?

Authors: Jose Bento, Andrea Montanari

Abstract: We consider the problem of learning the structure of Ising models (pairwise binary Markov random fields) from i.i.d. samples. While several methods have been proposed to accomplish this task, their relative merits and limitations remain somewhat obscure. By analyzing a number of concrete examples, we show that low-complexity algorithms systematically fail when the Markov random field develops lo… ▽ More We consider the problem of learning the structure of Ising models (pairwise binary Markov random fields) from i.i.d. samples. While several methods have been proposed to accomplish this task, their relative merits and limitations remain somewhat obscure. By analyzing a number of concrete examples, we show that low-complexity algorithms systematically fail when the Markov random field develops long-range correlations. More precisely, this phenomenon appears to be related to the Ising model phase transition (although it does not coincide with it). △ Less

Submitted 29 October, 2009; originally announced October 2009.

Showing 1–14 of 14 results for author: Bento, J