Skip to main content

Showing 1–21 of 21 results for author: Mobahi, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2401.10809  [pdf, other

    cs.LG

    Neglected Hessian component explains mysteries in Sharpness regularization

    Authors: Yann N. Dauphin, Atish Agarwala, Hossein Mobahi

    Abstract: Recent work has shown that methods like SAM which either explicitly or implicitly penalize second order information can improve generalization in deep learning. Seemingly similar methods like weight noise and gradient penalties often fail to provide such benefits. We show that these differences can be explained by the structure of the Hessian of the loss. First, we show that a common decomposition… ▽ More

    Submitted 24 January, 2024; v1 submitted 19 January, 2024; originally announced January 2024.

  2. arXiv:2310.16228  [pdf, other

    cs.LG cs.CV

    On the Foundations of Shortcut Learning

    Authors: Katherine L. Hermann, Hossein Mobahi, Thomas Fel, Michael C. Mozer

    Abstract: Deep-learning models can extract a rich assortment of features from data. Which features a model uses depends not only on predictivity-how reliably a feature indicates train-set labels-but also on availability-how easily the feature can be extracted, or leveraged, from inputs. The literature on shortcut learning has noted examples in which models privilege one feature over another, for example tex… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

  3. arXiv:2305.16292  [pdf, other

    cs.LG

    Sharpness-Aware Minimization Leads to Low-Rank Features

    Authors: Maksym Andriushchenko, Dara Bahri, Hossein Mobahi, Nicolas Flammarion

    Abstract: Sharpness-aware minimization (SAM) is a recently proposed method that minimizes the sharpness of the training loss of a neural network. While its generalization improvement is well-known and is the primary motivation, we uncover an additional intriguing effect of SAM: reduction of the feature rank which happens at different layers of a neural network. We show that this low-rank effect occurs very… ▽ More

    Submitted 28 October, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

    Comments: The camera-ready version (NeurIPS 2023)

  4. arXiv:2301.12923  [pdf, other

    cs.LG cs.AI stat.ML

    On student-teacher deviations in distillation: does it pay to disobey?

    Authors: Vaishnavh Nagarajan, Aditya Krishna Menon, Srinadh Bhojanapalli, Hossein Mobahi, Sanjiv Kumar

    Abstract: Knowledge distillation (KD) has been widely used to improve the test accuracy of a "student" network, by training it to mimic the soft probabilities of a trained "teacher" network. Yet, it has been shown in recent work that, despite being trained to fit the teacher's probabilities, the student may not only significantly deviate from the teacher probabilities, but may also outdo than the teacher in… ▽ More

    Submitted 18 March, 2024; v1 submitted 30 January, 2023; originally announced January 2023.

  5. arXiv:2110.08529  [pdf, other

    cs.CL cs.LG

    Sharpness-Aware Minimization Improves Language Model Generalization

    Authors: Dara Bahri, Hossein Mobahi, Yi Tay

    Abstract: The allure of superhuman-level capabilities has led to considerable interest in language models like GPT-3 and T5, wherein the research has, by and large, revolved around new model architectures, training tasks, and loss objectives, along with substantial engineering efforts to scale up model capacity and dataset size. Comparatively little work has been done to improve the generalization of these… ▽ More

    Submitted 15 March, 2022; v1 submitted 16 October, 2021; originally announced October 2021.

    Comments: ACL 2022 Main Conference

  6. arXiv:2103.10427  [pdf, other

    cs.LG cs.CV

    The Low-Rank Simplicity Bias in Deep Networks

    Authors: Minyoung Huh, Hossein Mobahi, Richard Zhang, Brian Cheung, Pulkit Agrawal, Phillip Isola

    Abstract: Modern deep neural networks are highly over-parameterized compared to the data on which they are trained, yet they often generalize remarkably well. A flurry of recent work has asked: why do deep networks not overfit to their training data? In this work, we make a series of empirical observations that investigate and extend the hypothesis that deeper networks are inductively biased to find solutio… ▽ More

    Submitted 23 March, 2023; v1 submitted 18 March, 2021; originally announced March 2021.

  7. arXiv:2012.07976  [pdf, other

    cs.LG stat.ML

    NeurIPS 2020 Competition: Predicting Generalization in Deep Learning

    Authors: Yiding Jiang, Pierre Foret, Scott Yak, Daniel M. Roy, Hossein Mobahi, Gintare Karolina Dziugaite, Samy Bengio, Suriya Gunasekar, Isabelle Guyon, Behnam Neyshabur

    Abstract: Understanding generalization in deep learning is arguably one of the most important questions in deep learning. Deep learning has been successfully adopted to a large number of problems ranging from pattern recognition to complex decision making, but many recent researchers have raised many concerns about deep learning, among which the most important is generalization. Despite numerous attempts, c… ▽ More

    Submitted 14 December, 2020; originally announced December 2020.

    Comments: 20 pages, 2 figures. Accepted for NeurIPS 2020 Competitions Track. Lead organizer: Yiding Jiang

  8. arXiv:2011.03010  [pdf, other

    cs.LG

    Data Augmentation via Structured Adversarial Perturbations

    Authors: Calvin Luo, Hossein Mobahi, Samy Bengio

    Abstract: Data augmentation is a major component of many machine learning methods with state-of-the-art performance. Common augmentation strategies work by drawing random samples from a space of transformations. Unfortunately, such sampling approaches are limited in expressivity, as they are unable to scale to rich transformations that depend on numerous parameters due to the curse of dimensionality. Advers… ▽ More

    Submitted 5 November, 2020; originally announced November 2020.

  9. arXiv:2010.02501  [pdf, other

    cs.LG math.OC stat.ML

    A Unifying View on Implicit Bias in Training Linear Neural Networks

    Authors: Chulhee Yun, Shankar Krishnan, Hossein Mobahi

    Abstract: We study the implicit bias of gradient flow (i.e., gradient descent with infinitesimal step size) on linear neural network training. We propose a tensor formulation of neural networks that includes fully-connected, diagonal, and convolutional networks as special cases, and investigate the linear version of the formulation called linear tensor networks. With this formulation, we can characterize th… ▽ More

    Submitted 10 September, 2021; v1 submitted 6 October, 2020; originally announced October 2020.

    Comments: 38 pages, 7 figures. Revision after ICLR 2021 camera-ready version. Figure 2 newly added, theorem statements revised, including correction of Theorem 2

  10. arXiv:2010.01412  [pdf, other

    cs.LG stat.ML

    Sharpness-Aware Minimization for Efficiently Improving Generalization

    Authors: Pierre Foret, Ariel Kleiner, Hossein Mobahi, Behnam Neyshabur

    Abstract: In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by prior work connecting the geometry of the loss landscape and generalization, we introduce a novel, effective procedure for instead simultan… ▽ More

    Submitted 29 April, 2021; v1 submitted 3 October, 2020; originally announced October 2020.

  11. arXiv:2002.05715  [pdf, other

    cs.LG stat.ML

    Self-Distillation Amplifies Regularization in Hilbert Space

    Authors: Hossein Mobahi, Mehrdad Farajtabar, Peter L. Bartlett

    Abstract: Knowledge distillation introduced in the deep learning context is a method to transfer knowledge from one architecture to another. In particular, when the architectures are identical, this is called self-distillation. The idea is to feed in predictions of the trained model as new target values for retraining (and iterate this loop possibly a few times). It has been empirically observed that the se… ▽ More

    Submitted 26 October, 2020; v1 submitted 13 February, 2020; originally announced February 2020.

  12. arXiv:1912.02178  [pdf, other

    cs.LG stat.ML

    Fantastic Generalization Measures and Where to Find Them

    Authors: Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, Samy Bengio

    Abstract: Generalization of deep networks has been of great interest in recent years, resulting in a number of theoretically and empirically motivated complexity measures. However, most papers proposing such measures study only a small set of models, leaving open the question of whether the conclusion drawn from those experiments would remain valid in other settings. We present the first large scale study o… ▽ More

    Submitted 4 December, 2019; originally announced December 2019.

  13. arXiv:1906.03808  [pdf, other

    cs.LG stat.ML

    A Closed-Form Learned Pooling for Deep Classification Networks

    Authors: Vighnesh Birodkar, Hossein Mobahi, Dilip Krishnan, Samy Bengio

    Abstract: In modern computer vision tasks, convolutional neural networks (CNNs) are indispensable for image classification tasks due to their efficiency and effectiveness. Part of their superiority compared to other architectures, comes from the fact that a single, local filter is shared across the entire image. However, there are scenarios where we may need to treat spatial locations in non-uniform manner.… ▽ More

    Submitted 10 June, 2019; originally announced June 2019.

  14. arXiv:1901.11409  [pdf, other

    cs.CV cs.LG stat.ML

    Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need

    Authors: Vighnesh Birodkar, Hossein Mobahi, Samy Bengio

    Abstract: Large datasets have been crucial to the success of deep learning models in the recent years, which keep performing better as they are trained with more labelled data. While there have been sustained efforts to make these models more data-efficient, the potential benefit of understanding the data itself, is largely untapped. Specifically, focusing on object recognition tasks, we wonder if for commo… ▽ More

    Submitted 29 January, 2019; originally announced January 2019.

  15. arXiv:1810.00113  [pdf, other

    stat.ML cs.LG

    Predicting the Generalization Gap in Deep Networks with Margin Distributions

    Authors: Yiding Jiang, Dilip Krishnan, Hossein Mobahi, Samy Bengio

    Abstract: As shown in recent research, deep neural networks can perfectly fit randomly labeled data, but with very poor accuracy on held out data. This phenomenon indicates that loss functions such as cross-entropy are not a reliable indicator of generalization. This leads to the crucial question of how generalization gap should be predicted from the training data and network parameters. In this paper, we p… ▽ More

    Submitted 12 June, 2019; v1 submitted 28 September, 2018; originally announced October 2018.

    Comments: Published in ICLR 2019

  16. arXiv:1803.05598  [pdf, other

    stat.ML cs.LG

    Large Margin Deep Networks for Classification

    Authors: Gamaleldin F. Elsayed, Dilip Krishnan, Hossein Mobahi, Kevin Regan, Samy Bengio

    Abstract: We present a formulation of deep learning that aims at producing a large margin classifier. The notion of margin, minimum distance to a decision boundary, has served as the foundation of several theoretically profound and empirically successful results for both classification and regression tasks. However, most large margin algorithms are applicable only to shallow models with a preset feature rep… ▽ More

    Submitted 3 December, 2018; v1 submitted 15 March, 2018; originally announced March 2018.

  17. arXiv:1610.09322  [pdf, other

    stat.ML cs.LG

    Homotopy Analysis for Tensor PCA

    Authors: Anima Anandkumar, Yuan Deng, Rong Ge, Hossein Mobahi

    Abstract: Develo** efficient and guaranteed nonconvex algorithms has been an important challenge in modern machine learning. Algorithms with good empirical performance such as stochastic gradient descent often lack theoretical guarantees. In this paper, we analyze the class of homotopy or continuation methods for global optimization of nonconvex functions. These methods start from an objective function th… ▽ More

    Submitted 13 June, 2017; v1 submitted 28 October, 2016; originally announced October 2016.

    Comments: Accepted to COLT 2017

  18. arXiv:1601.05116  [pdf, other

    cs.CV cs.LG

    A Theory of Local Matching: SIFT and Beyond

    Authors: Hossein Mobahi, Stefano Soatto

    Abstract: Why has SIFT been so successful? Why its extension, DSP-SIFT, can further improve SIFT? Is there a theory that can explain both? How can such theory benefit real applications? Can it suggest new algorithms with reduced computational complexity or new descriptors with better accuracy for matching? We construct a general theory of local descriptors for visual matching. Our theory relies on concepts… ▽ More

    Submitted 19 January, 2016; originally announced January 2016.

  19. arXiv:1601.04114  [pdf, other

    cs.LG

    Training Recurrent Neural Networks by Diffusion

    Authors: Hossein Mobahi

    Abstract: This work presents a new algorithm for training recurrent neural networks (although ideas are applicable to feedforward networks as well). The algorithm is derived from a theory in nonconvex optimization related to the diffusion equation. The contributions made in this work are two fold. First, we show how some seemingly disconnected mechanisms used in deep learning such as smart initialization, a… ▽ More

    Submitted 4 February, 2016; v1 submitted 15 January, 2016; originally announced January 2016.

  20. arXiv:1506.05439  [pdf, other

    cs.LG cs.CV stat.ML

    Learning with a Wasserstein Loss

    Authors: Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya-Polo, Tomaso Poggio

    Abstract: Learning to predict multi-label outputs is challenging, but in many problems there is a natural metric on the outputs that can be used to improve predictions. In this paper we develop a loss function for multi-label learning, based on the Wasserstein distance. The Wasserstein distance provides a natural notion of dissimilarity for probability measures. Although optimizing with respect to the exact… ▽ More

    Submitted 29 December, 2015; v1 submitted 17 June, 2015; originally announced June 2015.

    Comments: NIPS 2015; v3 updates Algorithm 1 and Equations 6, 8

  21. arXiv:1006.3679  [pdf, other

    cs.CV cs.IT cs.LG

    Segmentation of Natural Images by Texture and Boundary Compression

    Authors: Hossein Mobahi, Shankar R. Rao, Allen Y. Yang, Shankar S. Sastry, Yi Ma

    Abstract: We present a novel algorithm for segmentation of natural images that harnesses the principle of minimum description length (MDL). Our method is based on observations that a homogeneously textured region of a natural image can be well modeled by a Gaussian distribution and the region boundary can be effectively coded by an adaptive chain code. The optimal segmentation of an image is the one that gi… ▽ More

    Submitted 18 June, 2010; originally announced June 2010.