Search | arXiv e-print repository

Geodesic complexity of a cube

Abstract: The topological (resp. geodesic) complexity of a topological (resp. metric) space is roughly the smallest number of continuous rules required to choose paths (resp. shortest paths) between any points of the space. We prove that the geodesic complexity of a cube exceeds its topological complexity by exactly 2. The proof involves a careful analysis of cut loci of the cube. The topological (resp. geodesic) complexity of a topological (resp. metric) space is roughly the smallest number of continuous rules required to choose paths (resp. shortest paths) between any points of the space. We prove that the geodesic complexity of a cube exceeds its topological complexity by exactly 2. The proof involves a careful analysis of cut loci of the cube. △ Less

Submitted 8 August, 2023; originally announced August 2023.

MSC Class: 53C22; 52B10; 55M30

arXiv:2306.02601 [pdf, other]

Aiming towards the minimizers: fast convergence of SGD for overparametrized problems

Authors: Chaoyue Liu, Dmitriy Drusvyatskiy, Mikhail Belkin, Damek Davis, Yi-An Ma

Abstract: Modern machine learning paradigms, such as deep learning, occur in or close to the interpolation regime, wherein the number of model parameters is much larger than the number of data samples. In this work, we propose a regularity condition within the interpolation regime which endows the stochastic gradient method with the same worst-case iteration complexity as the deterministic gradient method,… ▽ More Modern machine learning paradigms, such as deep learning, occur in or close to the interpolation regime, wherein the number of model parameters is much larger than the number of data samples. In this work, we propose a regularity condition within the interpolation regime which endows the stochastic gradient method with the same worst-case iteration complexity as the deterministic gradient method, while using only a single sampled gradient (or a minibatch) in each iteration. In contrast, all existing guarantees require the stochastic gradient method to take small steps, thereby resulting in a much slower linear rate of convergence. Finally, we demonstrate that our condition holds when training sufficiently wide feedforward neural networks with a linear output layer. △ Less

Submitted 5 June, 2023; originally announced June 2023.

arXiv:2210.07774 [pdf, other]

Learning To Rank Diversely At Airbnb

Authors: Malay Haldar, Mustafa Abdool, Liwei He, Dillon Davis, Huiji Gao, Sanjeev Katariya

Abstract: Airbnb is a two-sided marketplace, bringing together hosts who own listings for rent, with prospective guests from around the globe. Applying neural network-based learning to rank techniques has led to significant improvements in matching guests with hosts. These improvements in ranking were driven by a core strategy: order the listings by their estimated booking probabilities, then iterate on tec… ▽ More Airbnb is a two-sided marketplace, bringing together hosts who own listings for rent, with prospective guests from around the globe. Applying neural network-based learning to rank techniques has led to significant improvements in matching guests with hosts. These improvements in ranking were driven by a core strategy: order the listings by their estimated booking probabilities, then iterate on techniques to make these booking probability estimates more and more accurate. Embedded implicitly in this strategy was an assumption that the booking probability of a listing could be determined independently of other listings in search results. In this paper we discuss how this assumption, pervasive throughout the commonly-used learning to rank frameworks, is false. We provide a theoretical foundation correcting this assumption, followed by efficient neural network architectures based on the theory. Explicitly accounting for possible similarities between listings, and reducing them to diversify the search results generated strong positive impact. We discuss these metric wins as part of the online A/B tests of the theory. Our method provides a practical way to diversify search results for large-scale production ranking systems. △ Less

Submitted 8 August, 2023; v1 submitted 19 September, 2022; originally announced October 2022.

Comments: Search ranking, Diversity, e-commerce

MSC Class: 68T07 ACM Class: H.3.3

arXiv:2209.02862 [pdf, other]

DAVE Aquatic Virtual Environment: Toward a General Underwater Robotics Simulator

Authors: Mabel M. Zhang, Woen-Sug Choi, Jessica Herman, Duane Davis, Carson Vogt, Michael McCarrin, Yadunund Vijay, Dharini Dutia, William Lew, Steven Peters, Brian Bingham

Abstract: We present DAVE Aquatic Virtual Environment (DAVE), an open source simulation stack for underwater robots, sensors, and environments. Conventional robotics simulators are not designed to address unique challenges that come with the marine environment, including but not limited to environment conditions that vary spatially and temporally, impaired or challenging perception, and the unavailability o… ▽ More We present DAVE Aquatic Virtual Environment (DAVE), an open source simulation stack for underwater robots, sensors, and environments. Conventional robotics simulators are not designed to address unique challenges that come with the marine environment, including but not limited to environment conditions that vary spatially and temporally, impaired or challenging perception, and the unavailability of data in a generally unexplored environment. Given the variety of sensors and platforms, wheels are often reinvented for specific use cases that inevitably resist wider adoption. Building on existing simulators, we provide a framework to help speed up the development and evaluation of algorithms that would otherwise require expensive and time-consuming operations at sea. The framework includes basic building blocks (e.g., new vehicles, water-tracking Doppler Velocity Logger, physics-based multibeam sonar) as well as development tools (e.g., dynamic bathymetry spawning, ocean currents), which allows the user to focus on methodology rather than software infrastructure. We demonstrate usage through example scenarios, bathymetric data import, user interfaces for data inspection and motion planning for manipulation, and visualizations. △ Less

Submitted 6 September, 2022; originally announced September 2022.

Comments: Accepted to IEEE/OES Autonomous Underwater Vehicles Symposium (AUV) 2022

arXiv:2204.10855 [pdf, other]

doi 10.1007/s40571-021-00392-3

ParticLS: Object-oriented software for discrete element methods and peridynamics

Authors: Andrew D. Davis, Brendan A. West, Nathanael J. Frisch, Devin T. O'Connor, Matthew D. Parno

Abstract: ParticLS (\emph{Partic}le \emph{L}evel \emph{S}ets) is a software library that implements the discrete element method (DEM) and meshfree methods. ParticLS tracks the interaction between individual particles whose geometries are defined by level sets capable of capturing complex shapes. These particles either represent rigid bodies or material points within a continuum. Particle-particle interactio… ▽ More ParticLS (\emph{Partic}le \emph{L}evel \emph{S}ets) is a software library that implements the discrete element method (DEM) and meshfree methods. ParticLS tracks the interaction between individual particles whose geometries are defined by level sets capable of capturing complex shapes. These particles either represent rigid bodies or material points within a continuum. Particle-particle interactions using various contact laws numerically approximate solutions to energy and mass conservation equations, simulating rigid body dynamics or deformation/fracture. By leveraging multiple contact laws, ParticLS can simulate interacting bodies that deform, fracture, and are composed of many particles. In the continuum setting, we numerically solve the peridynamic equations -- integro-differential equations capable of modeling objects with discontinuous displacement fields and complex fracture dynamics. We show that the discretized peridynamic equations can be solved using the same software infrastructure that implements the DEM. Therefore, we design a unique software library where users can easily add particles with arbitrary geometries and new contact laws that model either rigid-body interaction or peridynamic constitutive relationships. We demonstrate ParticLS' versatility on test problems meant to showcase features applicable to a broad selection of fields such as tectonics, granular media, multiscale simulations, glacier calving, and sea ice. △ Less

Submitted 19 April, 2022; originally announced April 2022.

Journal ref: Computational Particle Mechanics (2021)

arXiv:2110.01602 [pdf, other]

Clustering a Mixture of Gaussians with Unknown Covariance

Authors: Damek Davis, Mateo Díaz, Kaizheng Wang

Abstract: We investigate a clustering problem with data from a mixture of Gaussians that share a common but unknown, and potentially ill-conditioned, covariance matrix. We start by considering Gaussian mixtures with two equally-sized components and derive a Max-Cut integer program based on maximum likelihood estimation. We prove its solutions achieve the optimal misclassification rate when the number of sam… ▽ More We investigate a clustering problem with data from a mixture of Gaussians that share a common but unknown, and potentially ill-conditioned, covariance matrix. We start by considering Gaussian mixtures with two equally-sized components and derive a Max-Cut integer program based on maximum likelihood estimation. We prove its solutions achieve the optimal misclassification rate when the number of samples grows linearly in the dimension, up to a logarithmic factor. However, solving the Max-cut problem appears to be computationally intractable. To overcome this, we develop an efficient spectral algorithm that attains the optimal rate but requires a quadratic sample size. Although this sample complexity is worse than that of the Max-cut problem, we conjecture that no polynomial-time method can perform better. Furthermore, we gather numerical and theoretical evidence that supports the existence of a statistical-computational gap. Finally, we generalize the Max-Cut program to a $k$-means program that handles multi-component mixtures with possibly unequal weights. It enjoys similar optimality guarantees for mixtures of distributions that satisfy a transportation-cost inequality, encompassing Gaussian and strongly log-concave distributions. △ Less

Submitted 29 November, 2021; v1 submitted 4 October, 2021; originally announced October 2021.

Comments: 89 pages

MSC Class: 62H30; 62H12; 62H05

arXiv:2108.11832 [pdf, other]

Active manifolds, stratifications, and convergence to local minima in nonsmooth optimization

Authors: Damek Davis, Dmitriy Drusvyatskiy, Liwei Jiang

Abstract: We show that the subgradient method converges only to local minimizers when applied to generic Lipschitz continuous and subdifferentially regular functions that are definable in an o-minimal structure. At a high level, the argument we present is appealingly transparent: we interpret the nonsmooth dynamics as an approximate Riemannian gradient method on a certain distinguished submanifold that capt… ▽ More We show that the subgradient method converges only to local minimizers when applied to generic Lipschitz continuous and subdifferentially regular functions that are definable in an o-minimal structure. At a high level, the argument we present is appealingly transparent: we interpret the nonsmooth dynamics as an approximate Riemannian gradient method on a certain distinguished submanifold that captures the nonsmooth activity of the function. In the process, we develop new regularity conditions in nonsmooth analysis that parallel the stratification conditions of Whitney, Kuo, and Verdier and extend stochastic processes techniques of Pemantle. △ Less

Submitted 9 January, 2023; v1 submitted 26 August, 2021; originally announced August 2021.

Comments: Version 1 of the arxiv report has been split into two parts. Version 2 of the arxiv report is Part 1 of the original submission. Part 2 will appear as a separate arxiv submission

MSC Class: 65K05; 65K10; 90C15; 90C30; 90C06

arXiv:2106.09815 [pdf, other]

Esca** strict saddle points of the Moreau envelope in nonsmooth optimization

Authors: Damek Davis, Mateo Díaz, Dmitriy Drusvyatskiy

Abstract: Recent work has shown that stochastically perturbed gradient methods can efficiently escape strict saddle points of smooth functions. We extend this body of work to nonsmooth optimization, by analyzing an inexact analogue of a stochastically perturbed gradient method applied to the Moreau envelope. The main conclusion is that a variety of algorithms for nonsmooth optimization can escape strict sad… ▽ More Recent work has shown that stochastically perturbed gradient methods can efficiently escape strict saddle points of smooth functions. We extend this body of work to nonsmooth optimization, by analyzing an inexact analogue of a stochastically perturbed gradient method applied to the Moreau envelope. The main conclusion is that a variety of algorithms for nonsmooth optimization can escape strict saddle points of the Moreau envelope at a controlled rate. The main technical insight is that typical algorithms applied to the proximal subproblem yield directions that approximate the gradient of the Moreau envelope in relative terms. △ Less

Submitted 17 June, 2021; originally announced June 2021.

Comments: 29 pages, 1 figure

MSC Class: 65K05; 65K10; 90C15; 90C30; 90C06

arXiv:2009.14261 [pdf]

Abusive Language Detection and Characterization of Twitter Behavior

Authors: Dincy Davis, Reena Murali, Remesh Babu

Abstract: In this work, abusive language detection in online content is performed using Bidirectional Recurrent Neural Network (BiRNN) method. Here the main objective is to focus on various forms of abusive behaviors on Twitter and to detect whether a speech is abusive or not. The results are compared for various abusive behaviors in social media, with Convolutional Neural Netwrok (CNN) and Recurrent Neural… ▽ More In this work, abusive language detection in online content is performed using Bidirectional Recurrent Neural Network (BiRNN) method. Here the main objective is to focus on various forms of abusive behaviors on Twitter and to detect whether a speech is abusive or not. The results are compared for various abusive behaviors in social media, with Convolutional Neural Netwrok (CNN) and Recurrent Neural Network (RNN) methods and proved that the proposed BiRNN is a better deep learning model for automatic abusive speech detection. △ Less

Submitted 26 September, 2020; originally announced September 2020.

Comments: 7 pages, 7 figures and 8 tables

Journal ref: International Journal of Computer Sciences and Engineering, Vol.8, Issue.7, July 2020

arXiv:1912.07146 [pdf, other]

Proximal methods avoid active strict saddles of weakly convex functions

Authors: Damek Davis, Dmitriy Drusvyatskiy

Abstract: We introduce a geometrically transparent strict saddle property for nonsmooth functions. This property guarantees that simple proximal algorithms on weakly convex problems converge only to local minimizers, when randomly initialized. We argue that the strict saddle property may be a realistic assumption in applications, since it provably holds for generic semi-algebraic optimization problems. We introduce a geometrically transparent strict saddle property for nonsmooth functions. This property guarantees that simple proximal algorithms on weakly convex problems converge only to local minimizers, when randomly initialized. We argue that the strict saddle property may be a realistic assumption in applications, since it provably holds for generic semi-algebraic optimization problems. △ Less

Submitted 16 February, 2021; v1 submitted 15 December, 2019; originally announced December 2019.

Comments: 43 pages, 2 figures

MSC Class: 65K05; 65K10; 90C15; 90C30; 90C06

arXiv:1909.08159 [pdf, other]

Decision-Directed Data Decomposition

Authors: Brent D. Davis, Ethan Jackson, Daniel J. Lizotte

Abstract: We present an algorithm, Decision-Directed Data Decomposition (D4), which decomposes a dataset into two components. The first contains most of the useful information for a specified supervised learning task. The second orthogonal component contains little information about the task but retains associations and information that were not targeted. The algorithm is simple and scalable. We illustrate… ▽ More We present an algorithm, Decision-Directed Data Decomposition (D4), which decomposes a dataset into two components. The first contains most of the useful information for a specified supervised learning task. The second orthogonal component contains little information about the task but retains associations and information that were not targeted. The algorithm is simple and scalable. We illustrate its application in image and text processing domains. Our results show that 1) post-hoc application of D4 to an image representation space can remove information about specified concepts without impacting other concepts, 2) D4 is able to improve predictive generalization in certain settings, and 3) applying D4 to word embedding representations produces state-of-the-art results in debiasing. △ Less

Submitted 10 March, 2020; v1 submitted 17 September, 2019; originally announced September 2019.

arXiv:1907.13307 [pdf, ps, other]

From low probability to high confidence in stochastic convex optimization

Authors: Damek Davis, Dmitriy Drusvyatskiy, Lin Xiao, Junyu Zhang

Abstract: Standard results in stochastic convex optimization bound the number of samples that an algorithm needs to generate a point with small function value in expectation. More nuanced high probability guarantees are rare, and typically either rely on "light-tail" noise assumptions or exhibit worse sample complexity. In this work, we show that a wide class of stochastic optimization algorithms for strong… ▽ More Standard results in stochastic convex optimization bound the number of samples that an algorithm needs to generate a point with small function value in expectation. More nuanced high probability guarantees are rare, and typically either rely on "light-tail" noise assumptions or exhibit worse sample complexity. In this work, we show that a wide class of stochastic optimization algorithms for strongly convex problems can be augmented with high confidence bounds at an overhead cost that is only logarithmic in the confidence level and polylogarithmic in the condition number. The procedure we propose, called proxBoost, is elementary and builds on two well-known ingredients: robust distance estimation and the proximal point method. We discuss consequences for both streaming (online) algorithms and offline algorithms based on empirical risk minimization. △ Less

Submitted 16 October, 2019; v1 submitted 31 July, 2019; originally announced July 2019.

Comments: 37 pages

MSC Class: 65K05; 65K10; 90C15; 90C25

arXiv:1907.09547 [pdf, other]

Stochastic algorithms with geometric step decay converge linearly on sharp functions

Authors: Damek Davis, Dmitriy Drusvyatskiy, Vasileios Charisopoulos

Abstract: Stochastic (sub)gradient methods require step size schedule tuning to perform well in practice. Classical tuning strategies decay the step size polynomially and lead to optimal sublinear rates on (strongly) convex problems. An alternative schedule, popular in nonconvex optimization, is called \emph{geometric step decay} and proceeds by halving the step size after every few epochs. In recent work,… ▽ More Stochastic (sub)gradient methods require step size schedule tuning to perform well in practice. Classical tuning strategies decay the step size polynomially and lead to optimal sublinear rates on (strongly) convex problems. An alternative schedule, popular in nonconvex optimization, is called \emph{geometric step decay} and proceeds by halving the step size after every few epochs. In recent work, geometric step decay was shown to improve exponentially upon classical sublinear rates for the class of \emph{sharp} convex functions. In this work, we ask whether geometric step decay similarly improves stochastic algorithms for the class of sharp nonconvex problems. Such losses feature in modern statistical recovery problems and lead to a new challenge not present in the convex setting: the region of convergence is local, so one must bound the probability of escape. Our main result shows that for a large class of stochastic, sharp, nonsmooth, and nonconvex problems a geometric step decay schedule endows well-known algorithms with a local linear rate of convergence to global minimizers. This guarantee applies to the stochastic projected subgradient, proximal point, and prox-linear algorithms. As an application of our main result, we analyze two statistical recovery tasks---phase retrieval and blind deconvolution---and match the best known guarantees under Gaussian measurement models and establish new guarantees under heavy-tailed distributions. △ Less

Submitted 22 July, 2019; originally announced July 2019.

MSC Class: 65K05; 65K10; 90C15; 90C30; 90C06

arXiv:1904.10020 [pdf, other]

Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence

Authors: Vasileios Charisopoulos, Yudong Chen, Damek Davis, Mateo Díaz, Lijun Ding, Dmitriy Drusvyatskiy

Abstract: The task of recovering a low-rank matrix from its noisy linear measurements plays a central role in computational science. Smooth formulations of the problem often exhibit an undesirable phenomenon: the condition number, classically defined, scales poorly with the dimension of the ambient space. In contrast, we here show that in a variety of concrete circumstances, nonsmooth penalty formulations d… ▽ More The task of recovering a low-rank matrix from its noisy linear measurements plays a central role in computational science. Smooth formulations of the problem often exhibit an undesirable phenomenon: the condition number, classically defined, scales poorly with the dimension of the ambient space. In contrast, we here show that in a variety of concrete circumstances, nonsmooth penalty formulations do not suffer from the same type of ill-conditioning. Consequently, standard algorithms for nonsmooth optimization, such as subgradient and prox-linear methods, converge at a rapid dimension-independent rate when initialized within constant relative error of the solution. Moreover, nonsmooth formulations are naturally robust against outliers. Our framework subsumes such important computational tasks as phase retrieval, blind deconvolution, quadratic sensing, matrix completion, and robust PCA. Numerical experiments on these problems illustrate the benefits of the proposed approach. △ Less

Submitted 22 April, 2019; originally announced April 2019.

Comments: 80 pages

MSC Class: 65K10; 90C06

arXiv:1901.01624 [pdf, other]

Composite optimization for robust blind deconvolution

Authors: Vasileios Charisopoulos, Damek Davis, Mateo Díaz, Dmitriy Drusvyatskiy

Abstract: The blind deconvolution problem seeks to recover a pair of vectors from a set of rank one bilinear measurements. We consider a natural nonsmooth formulation of the problem and show that under standard statistical assumptions, its moduli of weak convexity, sharpness, and Lipschitz continuity are all dimension independent. This phenomenon persists even when up to half of the measurements are corrupt… ▽ More The blind deconvolution problem seeks to recover a pair of vectors from a set of rank one bilinear measurements. We consider a natural nonsmooth formulation of the problem and show that under standard statistical assumptions, its moduli of weak convexity, sharpness, and Lipschitz continuity are all dimension independent. This phenomenon persists even when up to half of the measurements are corrupted by noise. Consequently, standard algorithms, such as the subgradient and prox-linear methods, converge at a rapid dimension-independent rate when initialized within constant relative error of the solution. We then complete the paper with a new initialization strategy, complementing the local search algorithms. The initialization procedure is both provably efficient and robust to outlying measurements. Numerical experiments, on both simulated and real data, illustrate the developed theory and methods. △ Less

Submitted 18 January, 2019; v1 submitted 6 January, 2019; originally announced January 2019.

Comments: 60 pages, 14 figures

MSC Class: 65K10; 90C06

arXiv:1810.07590 [pdf, ps, other]

Graphical Convergence of Subgradients in Nonconvex Optimization and Learning

Authors: Damek Davis, Dmitriy Drusvyatskiy

Abstract: We investigate the stochastic optimization problem of minimizing population risk, where the loss defining the risk is assumed to be weakly convex. Compositions of Lipschitz convex functions with smooth maps are the primary examples of such losses. We analyze the estimation quality of such nonsmooth and nonconvex problems by their sample average approximations. Our main results establish dimension-… ▽ More We investigate the stochastic optimization problem of minimizing population risk, where the loss defining the risk is assumed to be weakly convex. Compositions of Lipschitz convex functions with smooth maps are the primary examples of such losses. We analyze the estimation quality of such nonsmooth and nonconvex problems by their sample average approximations. Our main results establish dimension-dependent rates on subgradient estimation in full generality and dimension-independent rates when the loss is a generalized linear model. As an application of the developed techniques, we analyze the nonsmooth landscape of a robust nonlinear regression problem. △ Less

Submitted 17 December, 2018; v1 submitted 17 October, 2018; originally announced October 2018.

Comments: 36 pages

MSC Class: 65K10; 90C15; 68Q32

arXiv:1810.05752 [pdf, ps, other]

Global Convergence of EM Algorithm for Mixtures of Two Component Linear Regression

Authors: Jeongyeol Kwon, Wei Qian, Constantine Caramanis, Yudong Chen, Damek Davis

Abstract: The Expectation-Maximization algorithm is perhaps the most broadly used algorithm for inference of latent variable problems. A theoretical understanding of its performance, however, largely remains lacking. Recent results established that EM enjoys global convergence for Gaussian Mixture Models. For Mixed Linear Regression, however, only local convergence results have been established, and those o… ▽ More The Expectation-Maximization algorithm is perhaps the most broadly used algorithm for inference of latent variable problems. A theoretical understanding of its performance, however, largely remains lacking. Recent results established that EM enjoys global convergence for Gaussian Mixture Models. For Mixed Linear Regression, however, only local convergence results have been established, and those only for the high SNR regime. We show here that EM converges for mixed linear regression with two components (it is known that it may fail to converge for three or more), and moreover that this convergence holds for random initialization. Our analysis reveals that EM exhibits very different behavior in Mixed Linear Regression from its behavior in Gaussian Mixture Models, and hence our proofs require the development of several new ideas. △ Less

Submitted 28 May, 2019; v1 submitted 12 October, 2018; originally announced October 2018.

Comments: To appear in the proceedings of the Conference on Learning Theory (COLT), 2019. This paper results from a merger of work from two groups who work on the problem at the same time

arXiv:1807.02876 [pdf, other]

Machine Learning in High Energy Physics Community White Paper

Authors: Kim Albertsson, Piero Altoe, Dustin Anderson, John Anderson, Michael Andrews, Juan Pedro Araque Espinosa, Adam Aurisano, Laurent Basara, Adrian Bevan, Wahid Bhimji, Daniele Bonacorsi, Bjorn Burkle, Paolo Calafiura, Mario Campanelli, Louis Capps, Federico Carminati, Stefano Carrazza, Yi-fan Chen, Taylor Childers, Yann Coadou, Elias Coniavitis, Kyle Cranmer, Claire David, Douglas Davis, Andrea De Simone , et al. (103 additional authors not shown)

Abstract: Machine learning has been applied to several problems in particle physics research, beginning with applications to high-level physics analysis in the 1990s and 2000s, followed by an explosion of applications in particle and event identification and reconstruction in the 2010s. In this document we discuss promising future research and development areas for machine learning in particle physics. We d… ▽ More Machine learning has been applied to several problems in particle physics research, beginning with applications to high-level physics analysis in the 1990s and 2000s, followed by an explosion of applications in particle and event identification and reconstruction in the 2010s. In this document we discuss promising future research and development areas for machine learning in particle physics. We detail a roadmap for their implementation, software and hardware resource requirements, collaborative initiatives with the data science community, academia and industry, and training the particle physics community in data science. The main objective of the document is to connect and motivate these areas of research and development with the physics drivers of the High-Luminosity Large Hadron Collider and future neutrino experiments and identify the resource needs for their implementation. Additionally we identify areas where collaboration with external communities will be of great benefit. △ Less

Submitted 16 May, 2019; v1 submitted 8 July, 2018; originally announced July 2018.

Comments: Editors: Sergei Gleyzer, Paul Seyfert and Steven Schramm

arXiv:1807.00255 [pdf, ps, other]

Stochastic model-based minimization under high-order growth

Authors: Damek Davis, Dmitriy Drusvyatskiy, Kellie J. MacPhee

Abstract: Given a nonsmooth, nonconvex minimization problem, we consider algorithms that iteratively sample and minimize stochastic convex models of the objective function. Assuming that the one-sided approximation quality and the variation of the models is controlled by a Bregman divergence, we show that the scheme drives a natural stationarity measure to zero at the rate $O(k^{-1/4})$. Under additional co… ▽ More Given a nonsmooth, nonconvex minimization problem, we consider algorithms that iteratively sample and minimize stochastic convex models of the objective function. Assuming that the one-sided approximation quality and the variation of the models is controlled by a Bregman divergence, we show that the scheme drives a natural stationarity measure to zero at the rate $O(k^{-1/4})$. Under additional convexity and relative strong convexity assumptions, the function values converge to the minimum at the rate of $O(k^{-1/2})$ and $\widetilde{O}(k^{-1})$, respectively. We discuss consequences for stochastic proximal point, mirror descent, regularized Gauss-Newton, and saddle point algorithms. △ Less

Submitted 30 June, 2018; originally announced July 2018.

Comments: 30 pages

MSC Class: 65K05; 65K10; 90C15; 90C30

arXiv:1804.07795 [pdf, other]

Stochastic subgradient method converges on tame functions

Authors: Damek Davis, Dmitriy Drusvyatskiy, Sham Kakade, Jason D. Lee

Abstract: This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity? We prove that the stochastic subgradient method, on any semialgebraic locally Lipschitz function, produces limit points that are all first-order stationary. More generally, our result applies to any function with a Whitney stratifiable graph. In part… ▽ More This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity? We prove that the stochastic subgradient method, on any semialgebraic locally Lipschitz function, produces limit points that are all first-order stationary. More generally, our result applies to any function with a Whitney stratifiable graph. In particular, this work endows the stochastic subgradient method, and its proximal extension, with rigorous convergence guarantees for a wide class of problems arising in data science---including all popular deep learning architectures. △ Less

Submitted 25 May, 2018; v1 submitted 20 April, 2018; originally announced April 2018.

Comments: 32 pages, 1 figure

MSC Class: 65K05; 65K10; 90C15; 90C30

arXiv:1803.06523 [pdf, other]

Stochastic model-based minimization of weakly convex functions

Authors: Damek Davis, Dmitriy Drusvyatskiy

Abstract: We consider a family of algorithms that successively sample and minimize simple stochastic models of the objective function. We show that under reasonable conditions on approximation quality and regularity of the models, any such algorithm drives a natural stationarity measure to zero at the rate $O(k^{-1/4})$. As a consequence, we obtain the first complexity guarantees for the stochastic proximal… ▽ More We consider a family of algorithms that successively sample and minimize simple stochastic models of the objective function. We show that under reasonable conditions on approximation quality and regularity of the models, any such algorithm drives a natural stationarity measure to zero at the rate $O(k^{-1/4})$. As a consequence, we obtain the first complexity guarantees for the stochastic proximal point, proximal subgradient, and regularized Gauss-Newton methods for minimizing compositions of convex functions with smooth maps. The guiding principle, underlying the complexity guarantees, is that all algorithms under consideration can be interpreted as approximate descent methods on an implicit smoothing of the problem, given by the Moreau envelope. Specializing to classical circumstances, we obtain the long-sought convergence rate of the stochastic projected gradient method, without batching, for minimizing a smooth function on a closed convex set. △ Less

Submitted 26 August, 2018; v1 submitted 17 March, 2018; originally announced March 2018.

Comments: 33 pages, 4 figures

MSC Class: 65K05; 65K10; 90C15; 90C30

arXiv:1802.02988 [pdf, ps, other]

Stochastic subgradient method converges at the rate $O(k^{-1/4})$ on weakly convex functions

Authors: Damek Davis, Dmitriy Drusvyatskiy

Abstract: We prove that the proximal stochastic subgradient method, applied to a weakly convex problem, drives the gradient of the Moreau envelope to zero at the rate $O(k^{-1/4})$. As a consequence, we resolve an open question on the convergence rate of the proximal stochastic gradient method for minimizing the sum of a smooth nonconvex function and a convex proximable function. We prove that the proximal stochastic subgradient method, applied to a weakly convex problem, drives the gradient of the Moreau envelope to zero at the rate $O(k^{-1/4})$. As a consequence, we resolve an open question on the convergence rate of the proximal stochastic gradient method for minimizing the sum of a smooth nonconvex function and a convex proximable function. △ Less

Submitted 19 February, 2018; v1 submitted 8 February, 2018; originally announced February 2018.

Comments: 12 pages

MSC Class: 65K05; 65K10; 90C15; 90C30

arXiv:1801.06519 [pdf, other]

Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights

Authors: Arun Mallya, Dillon Davis, Svetlana Lazebnik

Abstract: This work presents a method for adapting a single, fixed deep neural network to multiple tasks without affecting performance on already learned tasks. By building upon ideas from network quantization and pruning, we learn binary masks that piggyback on an existing network, or are applied to unmodified weights of that network to provide good performance on a new task. These masks are learned in an… ▽ More This work presents a method for adapting a single, fixed deep neural network to multiple tasks without affecting performance on already learned tasks. By building upon ideas from network quantization and pruning, we learn binary masks that piggyback on an existing network, or are applied to unmodified weights of that network to provide good performance on a new task. These masks are learned in an end-to-end differentiable fashion, and incur a low overhead of 1 bit per network parameter, per task. Even though the underlying network is fixed, the ability to mask individual weights allows for the learning of a large number of filters. We show performance comparable to dedicated fine-tuned networks for a variety of classification tasks, including those with large domain shifts from the initial task (ImageNet), and a variety of network architectures. Unlike prior work, we do not suffer from catastrophic forgetting or competition between tasks, and our performance is agnostic to task ordering. Code available at https://github.com/arunmallya/piggyback. △ Less

Submitted 16 March, 2018; v1 submitted 19 January, 2018; originally announced January 2018.

arXiv:1707.03505 [pdf, other]

Proximally Guided Stochastic Subgradient Method for Nonsmooth, Nonconvex Problems

Authors: Damek Davis, Benjamin Grimmer

Abstract: In this paper, we introduce a stochastic projected subgradient method for weakly convex (i.e., uniformly prox-regular) nonsmooth, nonconvex functions---a wide class of functions which includes the additive and convex composite classes. At a high-level, the method is an inexact proximal point iteration in which the strongly convex proximal subproblems are quickly solved with a specialized stochasti… ▽ More In this paper, we introduce a stochastic projected subgradient method for weakly convex (i.e., uniformly prox-regular) nonsmooth, nonconvex functions---a wide class of functions which includes the additive and convex composite classes. At a high-level, the method is an inexact proximal point iteration in which the strongly convex proximal subproblems are quickly solved with a specialized stochastic projected subgradient method. The primary contribution of this paper is a simple proof that the proposed algorithm converges at the same rate as the stochastic gradient method for smooth nonconvex problems. This result appears to be the first convergence rate analysis of a stochastic (or even deterministic) subgradient method for the class of weakly convex functions. △ Less

Submitted 17 September, 2018; v1 submitted 11 July, 2017; originally announced July 2017.

Comments: Updated 9/17/2018: Major Revision -added high probability bounds, improved convergence analysis in general, new experimental results. Updated 7/26/2017: Added references to introduction and a couple simple extensions as Sections 3.2 and 4. Updated 8/23/2017: Added NSF acknowledgements. Updated 10/16/2017: Added experimental results

MSC Class: 65K05; 65K10; 90C26; 90C15; 90C30

arXiv:1610.01101 [pdf, other]

A SMART Stochastic Algorithm for Nonconvex Optimization with Applications to Robust Machine Learning

Authors: Aleksandr Aravkin, Damek Davis

Abstract: In this paper, we show how to transform any optimization problem that arises from fitting a machine learning model into one that (1) detects and removes contaminated data from the training set while (2) simultaneously fitting the trimmed model on the uncontaminated data that remains. To solve the resulting nonconvex optimization problem, we introduce a fast stochastic proximal-gradient algorithm t… ▽ More In this paper, we show how to transform any optimization problem that arises from fitting a machine learning model into one that (1) detects and removes contaminated data from the training set while (2) simultaneously fitting the trimmed model on the uncontaminated data that remains. To solve the resulting nonconvex optimization problem, we introduce a fast stochastic proximal-gradient algorithm that incorporates prior knowledge through nonsmooth regularization. For datasets of size $n$, our approach requires $O(n^{2/3}/\varepsilon)$ gradient evaluations to reach $\varepsilon$-accuracy and, when a certain error bound holds, the complexity improves to $O(κn^{2/3}\log(1/\varepsilon))$. These rates are $n^{1/3}$ times better than those achieved by typical, full gradient methods. △ Less

Submitted 5 February, 2017; v1 submitted 4 October, 2016; originally announced October 2016.

Comments: 33 pages, 5 figures

MSC Class: 65K10; 65K05

arXiv:1505.00870 [pdf, other]

An $O(n\log(n))$ Algorithm for Projecting Onto the Ordered Weighted $\ell_1$ Norm Ball

Authors: Damek Davis

Abstract: The ordered weighted $\ell_1$ (OWL) norm is a newly developed generalization of the Octogonal Shrinkage and Clustering Algorithm for Regression (OSCAR) norm. This norm has desirable statistical properties and can be used to perform simultaneous clustering and regression. In this paper, we show how to compute the projection of an $n$-dimensional vector onto the OWL norm ball in $O(n\log(n))$ operat… ▽ More The ordered weighted $\ell_1$ (OWL) norm is a newly developed generalization of the Octogonal Shrinkage and Clustering Algorithm for Regression (OSCAR) norm. This norm has desirable statistical properties and can be used to perform simultaneous clustering and regression. In this paper, we show how to compute the projection of an $n$-dimensional vector onto the OWL norm ball in $O(n\log(n))$ operations. In addition, we illustrate the performance of our algorithm on a synthetic regression test. △ Less

Submitted 26 June, 2015; v1 submitted 4 May, 2015; originally announced May 2015.

Comments: 1 Figures, 1 table, 14 pages, Example added to appendix

MSC Class: 49M99 (primary); 90C90; 90C25; 49N45 (secondary)

arXiv:1311.6048 [pdf, other]

On the Design and Analysis of Multiple View Descriptors

Authors: **gming Dong, Jonathan Balzer, Damek Davis, Joshua Hernandez, Stefano Soatto

Abstract: We propose an extension of popular descriptors based on gradient orientation histograms (HOG, computed in a single image) to multiple views. It hinges on interpreting HOG as a conditional density in the space of sampled images, where the effects of nuisance factors such as viewpoint and illumination are marginalized. However, such marginalization is performed with respect to a very coarse approxim… ▽ More We propose an extension of popular descriptors based on gradient orientation histograms (HOG, computed in a single image) to multiple views. It hinges on interpreting HOG as a conditional density in the space of sampled images, where the effects of nuisance factors such as viewpoint and illumination are marginalized. However, such marginalization is performed with respect to a very coarse approximation of the underlying distribution. Our extension leverages on the fact that multiple views of the same scene allow separating intrinsic from nuisance variability, and thus afford better marginalization of the latter. The result is a descriptor that has the same complexity of single-view HOG, and can be compared in the same manner, but exploits multiple views to better trade off insensitivity to nuisance variability with specificity to intrinsic variability. We also introduce a novel multi-view wide-baseline matching dataset, consisting of a mixture of real and synthetic objects with ground truthed camera motion and dense three-dimensional geometry. △ Less

Submitted 23 November, 2013; originally announced November 2013.

Report number: UCLA CSD TR130024, Nov. 8, 2013

Showing 1–27 of 27 results for author: Davis, D