Skip to main content

Showing 1–50 of 85 results for author: Belkin, M

.
  1. arXiv:2403.15911  [pdf

    physics.optics quant-ph

    Intersubband polaritonic metasurfaces for high-contrast ultra-fast power limiting and optical switching

    Authors: Michele Cotrufo, Jonas Krakofsky, Sander A. Mann, Gerhard Böhm, Mikhail A. Belkin, Andrea Alù

    Abstract: Nonlinear intersubband polaritonic metasurfaces support one of the strongest known ultrafast nonlinear responses in the mid-infrared frequency range across all condensed matter systems. Beyond harmonic generation and frequency mixing, these nonlinearities can be leveraged for ultrafast optical switching and power limiting, based on tailored transitions from strong to weak polaritonic coupling. Her… ▽ More

    Submitted 23 March, 2024; originally announced March 2024.

  2. arXiv:2402.13728  [pdf, other

    cs.LG stat.ML

    Average gradient outer product as a mechanism for deep neural collapse

    Authors: Daniel Beaglehole, Peter Súkeník, Marco Mondelli, Mikhail Belkin

    Abstract: Deep Neural Collapse (DNC) refers to the surprisingly rigid structure of the data representations in the final layers of Deep Neural Networks (DNNs). Though the phenomenon has been measured in a variety of settings, its emergence is typically explained via data-agnostic approaches, such as the unconstrained features model. In this work, we introduce a data-dependent setting where DNC forms due to… ▽ More

    Submitted 23 May, 2024; v1 submitted 21 February, 2024; originally announced February 2024.

  3. arXiv:2402.10052  [pdf, other

    cs.CL cs.AI

    Unmemorization in Large Language Models via Self-Distillation and Deliberate Imagination

    Authors: Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, Ivan Vulić

    Abstract: While displaying impressive generation capabilities across many tasks, Large Language Models (LLMs) still struggle with crucial issues of privacy violation and unwanted exposure of sensitive data. This raises an essential question: how should we prevent such undesired behavior of LLMs while maintaining their strong generation and natural language understanding (NLU) capabilities? In this work, we… ▽ More

    Submitted 15 February, 2024; originally announced February 2024.

  4. arXiv:2401.04553  [pdf, other

    stat.ML cs.LG

    Linear Recursive Feature Machines provably recover low-rank matrices

    Authors: Adityanarayanan Radhakrishnan, Mikhail Belkin, Dmitriy Drusvyatskiy

    Abstract: A fundamental problem in machine learning is to understand how neural networks make accurate predictions, while seemingly bypassing the curse of dimensionality. A possible explanation is that common training algorithms for neural networks implicitly perform dimensionality reduction - a process called feature learning. Recent work posited that the effects of feature learning can be elicited from a… ▽ More

    Submitted 9 January, 2024; originally announced January 2024.

  5. arXiv:2312.03311  [pdf, other

    stat.ML cs.LG

    On the Nystrom Approximation for Preconditioning in Kernel Machines

    Authors: Amirhesam Abedsoltan, Parthe Pandit, Luis Rademacher, Mikhail Belkin

    Abstract: Kernel methods are a popular class of nonlinear predictive models in machine learning. Scalable algorithms for learning kernel models need to be iterative in nature, but convergence can be slow due to poor conditioning. Spectral preconditioning is an important tool to speed-up the convergence of such iterative algorithms for training kernel models. However computing and storing a spectral precondi… ▽ More

    Submitted 24 January, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

  6. arXiv:2311.14646  [pdf, other

    cs.LG stat.ML

    More is Better in Modern Machine Learning: when Infinite Overparameterization is Optimal and Overfitting is Obligatory

    Authors: James B. Simon, Dhruva Karkada, Nikhil Ghosh, Mikhail Belkin

    Abstract: In our era of enormous neural networks, empirical progress has been driven by the philosophy that more is better. Recent deep learning practice has found repeatedly that larger model size, more data, and more computation (resulting in lower training loss) improves performance. In this paper, we give theoretical backing to these empirical observations by showing that these three properties hold in… ▽ More

    Submitted 15 May, 2024; v1 submitted 24 November, 2023; originally announced November 2023.

    Comments: Appeared in ICLR 2024

  7. arXiv:2311.06488  [pdf

    physics.optics physics.app-ph

    Tuning Multipolar Mie Scattering of Particles on a Dielectric-Covered Mirror

    Authors: Kan Yao, Jie Fang, Taizhi Jiang, Andrew F. Briggs, Alec M. Skipper, Youngsun Kim, Mikhail A. Belkin, Brian A. Korgel, Seth R. Bank, Yuebing Zheng

    Abstract: Optically resonant particles are key building blocks of many nanophotonic devices such as optical antennas and metasurfaces. Because the functionalities of such devices are largely determined by the optical properties of individual resonators, extending the attainable responses from a given particle is highly desirable. Practically, this is usually achieved by introducing an asymmetric dielectric… ▽ More

    Submitted 11 November, 2023; originally announced November 2023.

    Comments: 16 pages, 4 figures

  8. arXiv:2309.00570  [pdf, other

    stat.ML cs.CV cs.LG

    Mechanism of feature learning in convolutional neural networks

    Authors: Daniel Beaglehole, Adityanarayanan Radhakrishnan, Parthe Pandit, Mikhail Belkin

    Abstract: Understanding the mechanism of how convolutional neural networks learn features from image data is a fundamental problem in machine learning and computer vision. In this work, we identify such a mechanism. We posit the Convolutional Neural Feature Ansatz, which states that covariances of filters in any convolutional layer are proportional to the average gradient outer product (AGOP) taken with res… ▽ More

    Submitted 1 September, 2023; originally announced September 2023.

  9. arXiv:2306.04815  [pdf, other

    cs.LG math.OC stat.ML

    Catapults in SGD: spikes in the training loss and their impact on generalization through feature learning

    Authors: Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin

    Abstract: In this paper, we first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD). We provide evidence that the spikes in the training loss of SGD are "catapults", an optimization phenomenon originally observed in GD with large learning rates in [Lewkowycz et al. 2020]. We empirically show that thes… ▽ More

    Submitted 5 June, 2024; v1 submitted 7 June, 2023; originally announced June 2023.

    Comments: ICML 2024

  10. arXiv:2306.02601  [pdf, other

    cs.LG math.OC stat.ML

    Aiming towards the minimizers: fast convergence of SGD for overparametrized problems

    Authors: Chaoyue Liu, Dmitriy Drusvyatskiy, Mikhail Belkin, Damek Davis, Yi-An Ma

    Abstract: Modern machine learning paradigms, such as deep learning, occur in or close to the interpolation regime, wherein the number of model parameters is much larger than the number of data samples. In this work, we propose a regularity condition within the interpolation regime which endows the stochastic gradient method with the same worst-case iteration complexity as the deterministic gradient method,… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

  11. arXiv:2306.02533  [pdf, ps, other

    cs.LG stat.ML

    On Emergence of Clean-Priority Learning in Early Stopped Neural Networks

    Authors: Chaoyue Liu, Amirhesam Abedsoltan, Mikhail Belkin

    Abstract: When random label noise is added to a training dataset, the prediction error of a neural network on a label-noise-free test dataset initially improves during early training but eventually deteriorates, following a U-shaped dependence on training time. This behaviour is believed to be a result of neural networks learning the pattern of clean data first and fitting the noise later in the training, a… ▽ More

    Submitted 4 June, 2023; originally announced June 2023.

  12. arXiv:2302.03952  [pdf, other

    cs.LG stat.ML

    Cut your Losses with Squentropy

    Authors: Like Hui, Mikhail Belkin, Stephen Wright

    Abstract: Nearly all practical neural models for classification are trained using cross-entropy loss. Yet this ubiquitous choice is supported by little theoretical or empirical evidence. Recent work (Hui & Belkin, 2020) suggests that training using the (rescaled) square loss is often superior in terms of the classification accuracy. In this paper we propose the "squentropy" loss, which is the sum of two ter… ▽ More

    Submitted 8 February, 2023; originally announced February 2023.

    Comments: 18 pages, 16 figures, 6 tables

  13. arXiv:2302.02605  [pdf, other

    cs.LG stat.ML

    Toward Large Kernel Models

    Authors: Amirhesam Abedsoltan, Mikhail Belkin, Parthe Pandit

    Abstract: Recent studies indicate that kernel machines can often perform similarly or better than deep neural networks (DNNs) on small datasets. The interest in kernel machines has been additionally bolstered by the discovery of their equivalence to wide neural networks in certain regimes. However, a key feature of DNNs is their ability to scale the model size and training data size independently, whereas i… ▽ More

    Submitted 19 June, 2023; v1 submitted 6 February, 2023; originally announced February 2023.

    Comments: Code is available at github.com/EigenPro/EigenPro3

  14. arXiv:2212.13881  [pdf, other

    cs.LG cs.AI stat.ML

    Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features

    Authors: Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, Mikhail Belkin

    Abstract: In recent years neural networks have achieved impressive results on many technological and scientific tasks. Yet, the mechanism through which these models automatically select features, or patterns in data, for prediction remains unclear. Identifying such a mechanism is key to advancing performance and interpretability of neural networks and promoting reliable adoption of these models in scientifi… ▽ More

    Submitted 9 May, 2023; v1 submitted 28 December, 2022; originally announced December 2022.

  15. arXiv:2209.15106  [pdf, other

    cs.LG math.OC

    Restricted Strong Convexity of Deep Learning Models with Smooth Activations

    Authors: Arindam Banerjee, Pedro Cisneros-Velarde, Libin Zhu, Mikhail Belkin

    Abstract: We consider the problem of optimization of deep learning models with smooth activation functions. While there exist influential results on the problem from the ``near initialization'' perspective, we shed considerable new light on the problem. In particular, we make two key technical contributions for such models with $L$ layers, $m$ width, and $σ_0^2$ initialization variance. First, for suitable… ▽ More

    Submitted 29 September, 2022; originally announced September 2022.

  16. arXiv:2207.11621  [pdf, other

    stat.ML cs.LG

    A Universal Trade-off Between the Model Size, Test Loss, and Training Loss of Linear Predictors

    Authors: Nikhil Ghosh, Mikhail Belkin

    Abstract: In this work we establish an algorithm and distribution independent non-asymptotic trade-off between the model size, excess test loss, and training loss of linear predictors. Specifically, we show that models that perform well on the test data (have low excess loss) are either "classical" -- have training loss close to the noise level, or are "modern" -- have a much larger number of parameters com… ▽ More

    Submitted 18 April, 2023; v1 submitted 23 July, 2022; originally announced July 2022.

    Comments: Further polished writing

  17. arXiv:2207.06569  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting

    Authors: Neil Mallinar, James B. Simon, Amirhesam Abedsoltan, Parthe Pandit, Mikhail Belkin, Preetum Nakkiran

    Abstract: The practical success of overparameterized neural networks has motivated the recent scientific study of interpolating methods, which perfectly fit their training data. Certain interpolating methods, including neural networks, can fit noisy training data without catastrophically bad test performance, in defiance of standard intuitions from statistical learning theory. Aiming to explain this, a body… ▽ More

    Submitted 20 October, 2022; v1 submitted 13 July, 2022; originally announced July 2022.

    Comments: NM and JS co-first authors

  18. arXiv:2206.15058  [pdf, other

    cs.LG stat.ML

    A note on Linear Bottleneck networks and their Transition to Multilinearity

    Authors: Libin Zhu, Parthe Pandit, Mikhail Belkin

    Abstract: Randomly initialized wide neural networks transition to linear functions of weights as the width grows, in a ball of radius $O(1)$ around initialization. A necessary condition for this result is that all layers of the network are wide enough, i.e., all widths tend to infinity. However, the transition to linearity breaks down when this infinite width assumption is violated. In this work we show tha… ▽ More

    Submitted 30 June, 2022; originally announced June 2022.

  19. arXiv:2205.13525  [pdf, other

    cs.LG

    On the Inconsistency of Kernel Ridgeless Regression in Fixed Dimensions

    Authors: Daniel Beaglehole, Mikhail Belkin, Parthe Pandit

    Abstract: ``Benign overfitting'', the ability of certain algorithms to interpolate noisy training data and yet perform well out-of-sample, has been a topic of considerable recent interest. We show, using a fixed design setup, that an important class of predictors, kernel machines with translation-invariant kernels, does not exhibit benign overfitting in fixed dimensions. In particular, the estimated predict… ▽ More

    Submitted 12 April, 2023; v1 submitted 26 May, 2022; originally announced May 2022.

  20. arXiv:2205.11787  [pdf, other

    cs.LG math.OC stat.ML

    Quadratic models for understanding catapult dynamics of neural networks

    Authors: Libin Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, Mikhail Belkin

    Abstract: While neural networks can be approximated by linear models as their width increases, certain properties of wide neural networks cannot be captured by linear models. In this work we show that recently proposed Neural Quadratic Models can exhibit the "catapult phase" [Lewkowycz et al. 2020] that arises when training such models with large learning rates. We then empirically show that the behaviour o… ▽ More

    Submitted 1 May, 2024; v1 submitted 24 May, 2022; originally announced May 2022.

    Comments: accepted in ICLR 2024; changed the title

  21. arXiv:2205.11786  [pdf, other

    cs.LG math.OC stat.ML

    Transition to Linearity of General Neural Networks with Directed Acyclic Graph Architecture

    Authors: Libin Zhu, Chaoyue Liu, Mikhail Belkin

    Abstract: In this paper we show that feedforward neural networks corresponding to arbitrary directed acyclic graphs undergo transition to linearity as their "width" approaches infinity. The width of these general networks is characterized by the minimum in-degree of their neurons, except for the input and first layers. Our results identify the mathematical structure underlying transition to linearity and ge… ▽ More

    Submitted 7 June, 2023; v1 submitted 24 May, 2022; originally announced May 2022.

    Comments: NeurIPS 2022

  22. Wide and Deep Neural Networks Achieve Optimality for Classification

    Authors: Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

    Abstract: While neural networks are used for classification tasks across domains, a long-standing open problem in machine learning is determining whether neural networks trained using standard procedures are optimal for classification, i.e., whether such models minimize the probability of misclassification for arbitrary data distributions. In this work, we identify and construct an explicit set of neural ne… ▽ More

    Submitted 29 April, 2022; originally announced April 2022.

  23. arXiv:2203.05104  [pdf, other

    cs.LG

    Transition to Linearity of Wide Neural Networks is an Emerging Property of Assembling Weak Models

    Authors: Chaoyue Liu, Libin Zhu, Mikhail Belkin

    Abstract: Wide neural networks with linear output layer have been shown to be near-linear, and to have near-constant neural tangent kernel (NTK), in a region containing the optimization path of gradient descent. These findings seem counter-intuitive since in general neural networks are highly complex models. Why does a linear structure emerge when the networks become wide? In this work, we provide a new per… ▽ More

    Submitted 9 March, 2022; originally announced March 2022.

    Comments: Published at ICLR 2022 (spotlight paper)

  24. arXiv:2202.08384  [pdf, other

    cs.LG cs.CV stat.ML

    Limitations of Neural Collapse for Understanding Generalization in Deep Learning

    Authors: Like Hui, Mikhail Belkin, Preetum Nakkiran

    Abstract: The recent work of Papyan, Han, & Donoho (2020) presented an intriguing "Neural Collapse" phenomenon, showing a structural property of interpolating classifiers in the late stage of training. This opened a rich area of exploration studying this phenomenon. Our motivation is to study the upper limits of this research program: How far will understanding Neural Collapse take us in understanding deep… ▽ More

    Submitted 16 February, 2022; originally announced February 2022.

  25. arXiv:2202.06526  [pdf, other

    cs.LG math.OC stat.ML

    Benign Overfitting in Two-layer Convolutional Neural Networks

    Authors: Yuan Cao, Zixiang Chen, Mikhail Belkin, Quanquan Gu

    Abstract: Modern neural networks often have great expressive power and can be trained to overfit the training data, while still achieving a good test performance. This phenomenon is referred to as "benign overfitting". Recently, there emerges a line of works studying "benign overfitting" from the theoretical perspective. However, they are limited to linear models or kernel/random feature models, and there i… ▽ More

    Submitted 14 June, 2022; v1 submitted 14 February, 2022; originally announced February 2022.

    Comments: 42 pages, 1 figure. Version 3 improves the presentation and adds a comparison with a concurrent work

  26. arXiv:2202.02458  [pdf

    cs.NI eess.SP

    Advanced service data provisioning in ROF-based mobile backhauls/fronthauls

    Authors: Mikhail E. Belkin, Leonid Zhukov, Alexander S. Sigov

    Abstract: A new cost-efficient concept to realize a real-time monitoring of quality-of-service metrics and other service data in 5G and beyond access network using a separate return channel based on a vertical cavity surface emitting laser in the optical injection locked mode that simultaneously operates as an optical transmitter and as a resonant cavity enhanced photodetector, is proposed and discussed. Th… ▽ More

    Submitted 31 January, 2022; originally announced February 2022.

    Comments: 9 pages, 4 figures, 6th International Conference on Networks and Communications (NET 2022) January 29~30, 2022, Copenhagen, Denmark

  27. arXiv:2112.14872  [pdf, other

    math.OC cs.LG

    Local Quadratic Convergence of Stochastic Gradient Descent with Adaptive Step Size

    Authors: Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

    Abstract: Establishing a fast rate of convergence for optimization methods is crucial to their applicability in practice. With the increasing popularity of deep learning over the past decade, stochastic gradient descent and its adaptive variants (e.g. Adagrad, Adam, etc.) have become prominent methods of choice for machine learning practitioners. While a large number of works have demonstrated that these fi… ▽ More

    Submitted 29 December, 2021; originally announced December 2021.

    Comments: ICML 2021 Workshop on Beyond first-order methods in ML systems

  28. arXiv:2110.07554  [pdf, other

    cs.LG cs.AI cs.SE

    Looper: An end-to-end ML platform for product decisions

    Authors: Igor L. Markov, Hanson Wang, Nitya Kasturi, Shaun Singh, Sze Wai Yuen, Mia Garrard, Sarah Tran, Yin Huang, Zehui Wang, Igor Glotov, Tanvi Gupta, Boshuang Huang, Peng Chen, Xiaowen Xie, Michael Belkin, Sal Uryasev, Sam Howie, Eytan Bakshy, Norm Zhou

    Abstract: Modern software systems and products increasingly rely on machine learning models to make data-driven decisions based on interactions with users, infrastructure and other systems. For broader adoption, this practice must (i) accommodate product engineers without ML backgrounds, (ii) support finegrain product-metric evaluation and (iii) optimize for product goals. To address shortcomings of prior p… ▽ More

    Submitted 21 June, 2022; v1 submitted 14 October, 2021; originally announced October 2021.

    Comments: 11 pages + references, 7 figures; to appear in KDD 2022

  29. Simple, Fast, and Flexible Framework for Matrix Completion with Infinite Width Neural Networks

    Authors: Adityanarayanan Radhakrishnan, George Stefanakis, Mikhail Belkin, Caroline Uhler

    Abstract: Matrix completion problems arise in many applications including recommendation systems, computer vision, and genomics. Increasingly larger neural networks have been successful in many of these applications, but at considerable computational costs. Remarkably, taking the width of a neural network to infinity allows for improved computational performance. In this work, we develop an infinite width n… ▽ More

    Submitted 21 February, 2022; v1 submitted 30 July, 2021; originally announced August 2021.

  30. arXiv:2105.14368  [pdf, other

    stat.ML cs.LG math.ST

    Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation

    Authors: Mikhail Belkin

    Abstract: In the past decade the mathematical theory of machine learning has lagged far behind the triumphs of deep neural networks on practical challenges. However, the gap between theory and practice is gradually starting to close. In this paper I will attempt to assemble some pieces of the remarkable and still incomplete mathematical mosaic emerging from the efforts to understand the foundations of deep… ▽ More

    Submitted 29 May, 2021; originally announced May 2021.

    Comments: A version of this paper will appear in Acta Numerica

  31. arXiv:2104.13628  [pdf, other

    cs.LG math.ST stat.ML

    Risk Bounds for Over-parameterized Maximum Margin Classification on Sub-Gaussian Mixtures

    Authors: Yuan Cao, Quanquan Gu, Mikhail Belkin

    Abstract: Modern machine learning systems such as deep neural networks are often highly over-parameterized so that they can fit the noisy training data exactly, yet they can still achieve small test errors in practice. In this paper, we study this "benign overfitting" phenomenon of the maximum margin classifier for linear classification problems. Specifically, we consider data generated from sub-Gaussian mi… ▽ More

    Submitted 2 January, 2022; v1 submitted 28 April, 2021; originally announced April 2021.

    Comments: 27 pages, 3 figures. In NeurIPS 2021

  32. arXiv:2010.01092  [pdf, other

    cs.LG stat.ML

    On the linearity of large non-linear models: when and why the tangent kernel is constant

    Authors: Chaoyue Liu, Libin Zhu, Mikhail Belkin

    Abstract: The goal of this work is to shed light on the remarkable phenomenon of transition to linearity of certain neural networks as their width approaches infinity. We show that the transition to linearity of the model and, equivalently, constancy of the (neural) tangent kernel (NTK) result from the scaling properties of the norm of the Hessian matrix of the network as a function of the network width. We… ▽ More

    Submitted 19 February, 2021; v1 submitted 2 October, 2020; originally announced October 2020.

    Comments: accepted as Spotlight in NeurIPS 2020; made correction to proof

  33. arXiv:2009.08574  [pdf, other

    cs.LG stat.ML

    Linear Convergence of Generalized Mirror Descent with Time-Dependent Mirrors

    Authors: Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

    Abstract: The Polyak-Lojasiewicz (PL) inequality is a sufficient condition for establishing linear convergence of gradient descent, even in non-convex settings. While several recent works use a PL-based analysis to establish linear convergence of stochastic gradient descent methods, the question remains as to whether a similar analysis can be conducted for more general optimization methods. In this work, we… ▽ More

    Submitted 6 October, 2021; v1 submitted 17 September, 2020; originally announced September 2020.

  34. arXiv:2008.01036  [pdf, other

    cs.LG math.ST stat.ML

    Multiple Descent: Design Your Own Generalization Curve

    Authors: Lin Chen, Yifei Min, Mikhail Belkin, Amin Karbasi

    Abstract: This paper explores the generalization loss of linear regression in variably parameterized families of models, both under-parameterized and over-parameterized. We show that the generalization curve can have an arbitrary number of peaks, and moreover, locations of those peaks can be explicitly controlled. Our results highlight the fact that both classical U-shaped generalization curve and the recen… ▽ More

    Submitted 8 November, 2021; v1 submitted 3 August, 2020; originally announced August 2020.

    Comments: Accepted to NeurIPS 2021

  35. arXiv:2006.07322  [pdf, other

    cs.LG stat.ML

    Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks

    Authors: Like Hui, Mikhail Belkin

    Abstract: Modern neural architectures for classification tasks are trained using the cross-entropy loss, which is widely believed to be empirically superior to the square loss. In this work we provide evidence indicating that this belief may not be well-founded. We explore several major neural architectures and a range of standard benchmark datasets for NLP, automatic speech recognition (ASR) and computer v… ▽ More

    Submitted 22 October, 2021; v1 submitted 12 June, 2020; originally announced June 2020.

    Comments: An extended version of the paper published at ICLR2021. Added material includes evaluations of Transformer architectures

  36. arXiv:2005.08054  [pdf, other

    cs.LG cs.IT stat.ML

    Classification vs regression in overparameterized regimes: Does the loss function matter?

    Authors: Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, Anant Sahai

    Abstract: We compare classification and regression tasks in an overparameterized linear model with Gaussian features. On the one hand, we show that with sufficient overparameterization all training points are support vectors: solutions obtained by least-squares minimum-norm interpolation, typically used for regression, are identical to those produced by the hard-margin support vector machine (SVM) that mini… ▽ More

    Submitted 14 October, 2021; v1 submitted 16 May, 2020; originally announced May 2020.

    Journal ref: Journal of Machine Learning Research, 22(222):1-69, 2021

  37. arXiv:2003.00307  [pdf, other

    cs.LG math.OC stat.ML

    Loss landscapes and optimization in over-parameterized non-linear systems and neural networks

    Authors: Chaoyue Liu, Libin Zhu, Mikhail Belkin

    Abstract: The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. The purpose of this work is to propose a modern view and a general mathematical framework for loss landscapes and efficient optimization in over-parameterized machine learning models and systems of non-linear equations, a setting that incl… ▽ More

    Submitted 26 May, 2021; v1 submitted 29 February, 2020; originally announced March 2020.

    Comments: The discussion on transition to linearity in Version 1 has been moved to arXiv:2010.01092 (appeared in NeurIPS 2020)

  38. Overparameterized Neural Networks Implement Associative Memory

    Authors: Adityanarayanan Radhakrishnan, Mikhail Belkin, Caroline Uhler

    Abstract: Identifying computational mechanisms for memorization and retrieval of data is a long-standing problem at the intersection of machine learning and neuroscience. Our main finding is that standard overparameterized deep neural networks trained using standard optimization methods implement such a mechanism for real-valued data. Empirically, we show that: (1) overparameterized autoencoders store train… ▽ More

    Submitted 9 September, 2020; v1 submitted 26 September, 2019; originally announced September 2019.

  39. arXiv:1903.07571  [pdf, other

    cs.LG stat.ML

    Two models of double descent for weak features

    Authors: Mikhail Belkin, Daniel Hsu, Ji Xu

    Abstract: The "double descent" risk curve was proposed to qualitatively describe the out-of-sample prediction accuracy of variably-parameterized machine learning models. This article provides a precise mathematical analysis for the shape of this curve in two simple data models with the least squares/least norm predictor. Specifically, it is shown that the risk peaks when the number of features $p$ is close… ▽ More

    Submitted 9 October, 2020; v1 submitted 18 March, 2019; originally announced March 2019.

    Journal ref: SIAM Journal on Mathematics of Data Science, 2(4):1167-1180, 2020

  40. Reconciling modern machine learning practice and the bias-variance trade-off

    Authors: Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal

    Abstract: Breakthroughs in machine learning are rapidly changing science and society, yet our fundamental understanding of this technology has lagged far behind. Indeed, one of the central tenets of the field, the bias-variance trade-off, appears to be at odds with the observed behavior of methods used in the modern machine learning practice. The bias-variance trade-off implies that a model should balance u… ▽ More

    Submitted 10 September, 2019; v1 submitted 28 December, 2018; originally announced December 2018.

  41. arXiv:1811.02564  [pdf, ps, other

    math.OC cs.LG stat.ML

    On exponential convergence of SGD in non-convex over-parametrized learning

    Authors: Raef Bassily, Mikhail Belkin, Siyuan Ma

    Abstract: Large over-parametrized models learned via stochastic gradient descent (SGD) methods have become a key element in modern machine learning. Although SGD methods are very effective in practice, most theoretical analyses of SGD suggest slower convergence than what is empirically observed. In our recent work [8] we analyzed how interpolation, common in modern over-parametrized learning, results in exp… ▽ More

    Submitted 5 November, 2018; originally announced November 2018.

  42. arXiv:1811.02095  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Kernel Machines Beat Deep Neural Networks on Mask-based Single-channel Speech Enhancement

    Authors: Like Hui, Siyuan Ma, Mikhail Belkin

    Abstract: We apply a fast kernel method for mask-based single-channel speech enhancement. Specifically, our method solves a kernel regression problem associated to a non-smooth kernel function (exponential power kernel) with a highly efficient iterative method (EigenPro). Due to the simplicity of this method, its hyper-parameters such as kernel bandwidth can be automatically and efficiently selected using l… ▽ More

    Submitted 5 November, 2018; originally announced November 2018.

  43. arXiv:1810.13395  [pdf, other

    cs.LG stat.ML

    Accelerating SGD with momentum for over-parameterized learning

    Authors: Chaoyue Liu, Mikhail Belkin

    Abstract: Nesterov SGD is widely used for training modern neural networks and other machine learning models. Yet, its advantages over SGD have not been theoretically clarified. Indeed, as we show in our paper, both theoretically and empirically, Nesterov SGD with any parameter selection does not in general provide acceleration over ordinary SGD. Furthermore, Nesterov SGD may diverge for step sizes that ensu… ▽ More

    Submitted 27 September, 2019; v1 submitted 31 October, 2018; originally announced October 2018.

    Comments: new version

  44. arXiv:1810.10333  [pdf, other

    cs.CV cs.LG stat.ML

    Memorization in Overparameterized Autoencoders

    Authors: Adityanarayanan Radhakrishnan, Karren Yang, Mikhail Belkin, Caroline Uhler

    Abstract: The ability of deep neural networks to generalize well in the overparameterized regime has become a subject of significant research interest. We show that overparameterized autoencoders exhibit memorization, a form of inductive bias that constrains the functions learned through the optimization process to concentrate around the training examples, although the network could in principle represent a… ▽ More

    Submitted 3 September, 2019; v1 submitted 16 October, 2018; originally announced October 2018.

  45. Container solutions for HPC Systems: A Case Study of Using Shifter on Blue Waters

    Authors: Maxim Belkin, Roland Haas, Galen Wesley Arnold, Hon Wai Leong, Eliu A. Huerta, David Lesny, Mark Neubauer

    Abstract: Software container solutions have revolutionized application development approaches by enabling lightweight platform abstractions within the so-called "containers." Several solutions are being actively developed in attempts to bring the benefits of containers to high-performance computing systems with their stringent security demands on the one hand and fundamental resource sharing requirements on… ▽ More

    Submitted 1 August, 2018; originally announced August 2018.

    Comments: 8 pages, 7 figures, in PEARC '18: Proceedings of Practice and Experience in Advanced Research Computing, July 22--26, 2018, Pittsburgh, PA, USA

  46. arXiv:1806.09471  [pdf, other

    stat.ML cs.LG math.ST

    Does data interpolation contradict statistical optimality?

    Authors: Mikhail Belkin, Alexander Rakhlin, Alexandre B. Tsybakov

    Abstract: We show that learning methods interpolating the training data can achieve optimal rates for the problems of nonparametric regression and prediction with square loss.

    Submitted 25 June, 2018; originally announced June 2018.

  47. arXiv:1806.06144  [pdf, other

    stat.ML cs.LG

    Kernel machines that adapt to GPUs for effective large batch training

    Authors: Siyuan Ma, Mikhail Belkin

    Abstract: Modern machine learning models are typically trained using Stochastic Gradient Descent (SGD) on massively parallel computing resources such as GPUs. Increasing mini-batch size is a simple and direct way to utilize the parallel computing capacity. For small batch an increase in batch size results in the proportional reduction in the training time, a phenomenon known as linear scaling. However, incr… ▽ More

    Submitted 3 March, 2019; v1 submitted 15 June, 2018; originally announced June 2018.

  48. arXiv:1806.05161  [pdf, other

    stat.ML cond-mat.stat-mech cs.LG

    Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate

    Authors: Mikhail Belkin, Daniel Hsu, Partha Mitra

    Abstract: Many modern machine learning models are trained to achieve zero or near-zero training error in order to obtain near-optimal (but non-zero) test error. This phenomenon of strong generalization performance for "overfitted" / interpolated classifiers appears to be ubiquitous in high-dimensional data, having been observed in deep networks, kernel machines, boosting and random forests. Their performanc… ▽ More

    Submitted 26 October, 2018; v1 submitted 13 June, 2018; originally announced June 2018.

  49. arXiv:1804.11260  [pdf

    physics.app-ph cond-mat.mes-hall

    Unveiling spectral purity and tunability of terahertz quantum cascade laser sources based on intra-cavity difference frequency generation

    Authors: Luigi Consolino, Seungyong Jung, Annamaria Campa, Michele De Regis, Shovon Pal, Jae Hyun Kim, Kazuue Fujita, Akio Ito, Masahiro Hitaka, Saverio Bartalini, Paolo De Natale, Mikhail A. Belkin, Miriam Serena Vitiello

    Abstract: Terahertz sources based on intra-cavity difference-frequency generation in mid-infrared quantum cascade lasers (THz DFG-QCLs) have recently emerged as the first monolithic electrically-pumped semiconductor sources capable of operating at room-temperature (RT) across the 1-6 THz range. Despite tremendous progress in power output, that now exceeds 1mW in pulsed and 10 μW in continuous-wave regime at… ▽ More

    Submitted 27 April, 2018; originally announced April 2018.

    Journal ref: Sci. Adv. 2017, 3: e1603317

  50. arXiv:1802.10235  [pdf, other

    cs.LG math.OC

    Parametrized Accelerated Methods Free of Condition Number

    Authors: Chaoyue Liu, Mikhail Belkin

    Abstract: Analyses of accelerated (momentum-based) gradient descent usually assume bounded condition number to obtain exponential convergence rates. However, in many real problems, e.g., kernel methods or deep neural networks, the condition number, even locally, can be unbounded, unknown or mis-estimated. This poses problems in both implementing and analyzing accelerated algorithms. In this paper, we addres… ▽ More

    Submitted 27 February, 2018; originally announced February 2018.

    Comments: 23 pages, 3 figures