Skip to main content

Showing 1–17 of 17 results for author: Ghorbani, B

.
  1. arXiv:2312.06134  [pdf, other

    cs.CL cs.LG

    Order Matters in the Presence of Dataset Imbalance for Multilingual Learning

    Authors: Dami Choi, Derrick Xin, Hamid Dadkhahi, Justin Gilmer, Ankush Garg, Orhan Firat, Chih-Kuan Yeh, Andrew M. Dai, Behrooz Ghorbani

    Abstract: In this paper, we empirically study the optimization dynamics of multi-task learning, particularly focusing on those that govern a collection of tasks with significant data imbalance. We present a simple yet effective method of pre-training on high-resource tasks, followed by fine-tuning on a mixture of high/low-resource tasks. We provide a thorough empirical study and analysis of this method's be… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

  2. arXiv:2305.09860  [pdf, other

    cs.CL cs.AI cs.LG

    Epsilon Sampling Rocks: Investigating Sampling Strategies for Minimum Bayes Risk Decoding for Machine Translation

    Authors: Markus Freitag, Behrooz Ghorbani, Patrick Fernandes

    Abstract: Recent advances in machine translation (MT) have shown that Minimum Bayes Risk (MBR) decoding can be a powerful alternative to beam search decoding, especially when combined with neural-based utility functions. However, the performance of MBR decoding depends heavily on how and how many candidates are sampled from the model. In this paper, we explore how different sampling approaches for generatin… ▽ More

    Submitted 17 May, 2023; v1 submitted 16 May, 2023; originally announced May 2023.

  3. arXiv:2302.09650  [pdf, other

    cs.CL cs.LG

    Scaling Laws for Multilingual Neural Machine Translation

    Authors: Patrick Fernandes, Behrooz Ghorbani, Xavier Garcia, Markus Freitag, Orhan Firat

    Abstract: In this work, we provide a large-scale empirical study of the scaling properties of multilingual neural machine translation models. We examine how increases in the model size affect the model performance and investigate the role of the training mixture composition on the scaling behavior. We find that changing the weightings of the individual language pairs in the training mixture only affect the… ▽ More

    Submitted 19 February, 2023; originally announced February 2023.

    Comments: 19 pages, 20 figures

  4. arXiv:2302.04907  [pdf, other

    cs.CL cs.LG

    Binarized Neural Machine Translation

    Authors: Yichi Zhang, Ankush Garg, Yuan Cao, Łukasz Lew, Behrooz Ghorbani, Zhiru Zhang, Orhan Firat

    Abstract: The rapid scaling of language models is motivating research using low-bitwidth quantization. In this work, we propose a novel binarization technique for Transformers applied to machine translation (BMT), the first of its kind. We identify and address the problem of inflated dot-product variance when using one-bit weights and activations. Specifically, BMT leverages additional LayerNorms and residu… ▽ More

    Submitted 9 February, 2023; originally announced February 2023.

    Journal ref: Published at NeurIPS 2023

  5. arXiv:2209.11379  [pdf, other

    cs.LG cs.AI

    Do Current Multi-Task Optimization Methods in Deep Learning Even Help?

    Authors: Derrick Xin, Behrooz Ghorbani, Ankush Garg, Orhan Firat, Justin Gilmer

    Abstract: Recent research has proposed a series of specialized optimization algorithms for deep multi-task models. It is often claimed that these multi-task optimization (MTO) methods yield solutions that are superior to the ones found by simply optimizing a weighted average of the task losses. In this paper, we perform large-scale experiments on a variety of language and vision tasks to examine the empiric… ▽ More

    Submitted 22 September, 2022; originally announced September 2022.

  6. arXiv:2207.14484  [pdf, other

    cs.LG

    Adaptive Gradient Methods at the Edge of Stability

    Authors: Jeremy M. Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E. Dahl, Justin Gilmer

    Abstract: Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we empirically demonstrate that during full-batch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical… ▽ More

    Submitted 15 April, 2024; v1 submitted 29 July, 2022; originally announced July 2022.

    Comments: v2 corrects the formula for Adam's preconditioner in Eq 2

  7. arXiv:2202.01994  [pdf, other

    cs.LG cs.CL

    Data Scaling Laws in NMT: The Effect of Noise and Architecture

    Authors: Yamini Bansal, Behrooz Ghorbani, Ankush Garg, Biao Zhang, Maxim Krikun, Colin Cherry, Behnam Neyshabur, Orhan Firat

    Abstract: In this work, we study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT). First, we establish that the test loss of encoder-decoder transformer models scales as a power law in the number of training samples, with a dependence on the model size. Then, we systematically vary aspects of the training setup to understand… ▽ More

    Submitted 4 February, 2022; originally announced February 2022.

  8. arXiv:2202.00528  [pdf, other

    cs.CL cs.LG

    Examining Scaling and Transfer of Language Model Architectures for Machine Translation

    Authors: Biao Zhang, Behrooz Ghorbani, Ankur Bapna, Yong Cheng, Xavier Garcia, Jonathan Shen, Orhan Firat

    Abstract: Natural language understanding and generation models follow one of the two dominant architectural paradigms: language models (LMs) that process concatenated sequences in a single stack of layers, and encoder-decoder models (EncDec) that utilize separate layer stacks for input and output processing. In machine translation, EncDec has long been the favoured approach, but with few studies investigati… ▽ More

    Submitted 16 February, 2022; v1 submitted 1 February, 2022; originally announced February 2022.

  9. arXiv:2110.04369  [pdf, other

    cs.LG cs.AI

    A Loss Curvature Perspective on Training Instability in Deep Learning

    Authors: Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam Neyshabur, David Cardoze, George Dahl, Zachary Nado, Orhan Firat

    Abstract: In this work, we study the evolution of the loss Hessian across many classification tasks in order to understand the effect the curvature of the loss has on the training dynamics. Whereas prior work has focused on how different learning rates affect the loss Hessian observed during training, we also analyze the effects of model initialization, architectural choices, and common training heuristics… ▽ More

    Submitted 8 October, 2021; originally announced October 2021.

    Comments: 20 pages, 16 figures

  10. arXiv:2109.07740  [pdf, other

    cs.LG cs.AI cs.CL

    Scaling Laws for Neural Machine Translation

    Authors: Behrooz Ghorbani, Orhan Firat, Markus Freitag, Ankur Bapna, Maxim Krikun, Xavier Garcia, Ciprian Chelba, Colin Cherry

    Abstract: We present an empirical study of scaling properties of encoder-decoder Transformer models used in neural machine translation (NMT). We show that cross-entropy loss as a function of model size follows a certain scaling law. Specifically (i) We propose a formula which describes the scaling behavior of cross-entropy loss as a bivariate function of encoder and decoder size, and show that it gives accu… ▽ More

    Submitted 16 September, 2021; originally announced September 2021.

    Comments: 31 pages, 23 figures

  11. arXiv:2006.13409  [pdf, other

    stat.ML cs.LG math.ST

    When Do Neural Networks Outperform Kernel Methods?

    Authors: Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Andrea Montanari

    Abstract: For a certain scaling of the initialization of stochastic gradient descent (SGD), wide neural networks (NN) have been shown to be well approximated by reproducing kernel Hilbert space (RKHS) methods. Recent empirical work showed that, for some classification tasks, RKHS methods can replace NNs without a large loss in performance. On the other hand, two-layers NNs are known to encode richer smoothn… ▽ More

    Submitted 9 November, 2021; v1 submitted 23 June, 2020; originally announced June 2020.

    Comments: 100 pages, 12 figures

    MSC Class: 62J99 (Primary)

  12. arXiv:1906.08899  [pdf, other

    stat.ML cs.LG math.ST

    Limitations of Lazy Training of Two-layers Neural Networks

    Authors: Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Andrea Montanari

    Abstract: We study the supervised learning problem under either of the following two models: (1) Feature vectors ${\boldsymbol x}_i$ are $d$-dimensional Gaussians and responses are $y_i = f_*({\boldsymbol x}_i)$ for $f_*$ an unknown quadratic function; (2) Feature vectors ${\boldsymbol x}_i$ are distributed as a mixture of two $d$-dimensional centered Gaussians, and $y_i$'s are the corresponding class label… ▽ More

    Submitted 20 June, 2019; originally announced June 2019.

    Comments: 39 pages; 2 pdf figures

  13. arXiv:1904.12191  [pdf, other

    math.ST cs.LG

    Linearized two-layers neural networks in high dimension

    Authors: Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Andrea Montanari

    Abstract: We consider the problem of learning an unknown function $f_{\star}$ on the $d$-dimensional sphere with respect to the square loss, given i.i.d. samples $\{(y_i,{\boldsymbol x}_i)\}_{i\le n}$ where ${\boldsymbol x}_i$ is a feature vector uniformly distributed on the sphere and $y_i=f_{\star}({\boldsymbol x}_i)+\varepsilon_i$. We study two popular classes of models that can be regarded as linearizat… ▽ More

    Submitted 16 February, 2020; v1 submitted 27 April, 2019; originally announced April 2019.

    Comments: 65 pages; 17 pdf figures

  14. arXiv:1901.10159  [pdf, other

    cs.LG stat.ML

    An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

    Authors: Behrooz Ghorbani, Shankar Krishnan, Ying Xiao

    Abstract: To understand the dynamics of optimization in deep neural networks, we develop a tool to study the evolution of the entire Hessian spectrum throughout the optimization process. Using this, we study a number of hypotheses concerning smoothness, curvature, and sharpness in the deep learning literature. We then thoroughly analyze a crucial structural feature of the spectra: in non-batch normalized ne… ▽ More

    Submitted 29 January, 2019; originally announced January 2019.

    Comments: 21 pages, 19 figures

  15. arXiv:1810.07403  [pdf, ps, other

    math.ST stat.ME

    Optimal Covariance Estimation for Condition Number Loss in the Spiked Model

    Authors: David L. Donoho, Behrooz Ghorbani

    Abstract: We study estimation of the covariance matrix under relative condition number loss $κ(Σ^{-1/2} \hatΣ Σ^{-1/2})$, where $κ(Δ)$ is the condition number of matrix $Δ$, and $\hatΣ$ and $Σ$ are the estimated and theoretical covariance matrices. Optimality in $κ$-loss provides optimal guarantees in two stylized applications: Multi-User Covariance Estimation and Multi-Task Linear Discriminant Analysis. We… ▽ More

    Submitted 17 October, 2018; originally announced October 2018.

    Comments: 85 pages, 4 figures

  16. arXiv:1802.00568  [pdf, other

    stat.ML

    An Instability in Variational Inference for Topic Models

    Authors: Behrooz Ghorbani, Hamid Javadi, Andrea Montanari

    Abstract: Topic models are Bayesian models that are frequently used to capture the latent structure of certain corpora of documents or images. Each data element in such a corpus (for instance each item in a collection of scientific articles) is regarded as a convex combination of a small number of vectors corresponding to `topics' or `components'. The weights are assumed to have a Dirichlet prior distributi… ▽ More

    Submitted 2 February, 2018; originally announced February 2018.

    Comments: 69 pages; 18 pdf figures

  17. arXiv:1504.00984  [pdf, other

    cs.IT

    Sparse regression with highly correlated predictors

    Authors: Behrooz Ghorbani, Ozgur Yilmaz

    Abstract: We consider a linear regression $y=Xβ+u$ where $X\in\mathbb{\mathbb{R}}^{n\times p}$, $p\gg n,$ and $β$ is $s$-sparse. Motivated by examples in financial and economic data, we consider the situation where $X$ has highly correlated and clustered columns. To perform sparse recovery in this setting, we introduce the \emph{clustering removal algorithm} (CRA), that seeks to decrease the correlation in… ▽ More

    Submitted 4 April, 2015; originally announced April 2015.