Search | arXiv e-print repository

Non-Vacuous Generalization Bounds for Large Language Models

Authors: Sanae Lotfi, Marc Finzi, Yilun Kuang, Tim G. J. Rudner, Micah Goldblum, Andrew Gordon Wilson

Abstract: Modern language models can contain billions of parameters, raising the question of whether they can generalize beyond the training data or simply regurgitate their training corpora. We provide the first non-vacuous generalization bounds for pretrained large language models (LLMs), indicating that language models are capable of discovering regularities that generalize to unseen data. In particular,… ▽ More Modern language models can contain billions of parameters, raising the question of whether they can generalize beyond the training data or simply regurgitate their training corpora. We provide the first non-vacuous generalization bounds for pretrained large language models (LLMs), indicating that language models are capable of discovering regularities that generalize to unseen data. In particular, we derive a compression bound that is valid for the unbounded log-likelihood loss using prediction smoothing, and we extend the bound to handle subsampling, accelerating bound computation on massive datasets. To achieve the extreme level of compression required for non-vacuous generalization bounds, we devise SubLoRA, a low-dimensional non-linear parameterization. Using this approach, we find that larger models have better generalization bounds and are more compressible than smaller models. △ Less

Submitted 12 February, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

arXiv:2211.13609 [pdf, other]

PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization

Authors: Sanae Lotfi, Marc Finzi, Sanyam Kapoor, Andres Potapczynski, Micah Goldblum, Andrew Gordon Wilson

Abstract: While there has been progress in develo** non-vacuous generalization bounds for deep neural networks, these bounds tend to be uninformative about why deep learning works. In this paper, we develop a compression approach based on quantizing neural network parameters in a linear subspace, profoundly improving on previous results to provide state-of-the-art generalization bounds on a variety of tas… ▽ More While there has been progress in develo** non-vacuous generalization bounds for deep neural networks, these bounds tend to be uninformative about why deep learning works. In this paper, we develop a compression approach based on quantizing neural network parameters in a linear subspace, profoundly improving on previous results to provide state-of-the-art generalization bounds on a variety of tasks, including transfer learning. We use these tight bounds to better understand the role of model size, equivariance, and the implicit biases of optimization, for generalization in deep learning. Notably, we find large models can be compressed to a much greater extent than previously known, encapsulating Occam's razor. We also argue for data-independent bounds in explaining generalization. △ Less

Submitted 24 November, 2022; originally announced November 2022.

Comments: NeurIPS 2022. Code is available at https://github.com/activatedgeek/tight-pac-bayes

arXiv:2209.03147 [pdf, other]

Network Intrusion Detection with Limited Labeled Data Using Self-supervision

Authors: S. Lotfi, M. Modirrousta, S. Shashaani, M. Aliyari Shoorehdeli

Abstract: With the increasing dependency of daily life over computer networks, the importance of these networks security becomes prominent. Different intrusion attacks to networks have been designed and the attackers are working on improving them. Thus the ability to detect intrusion with limited number of labeled data is desirable to provide networks with higher level of security. In this paper we design a… ▽ More With the increasing dependency of daily life over computer networks, the importance of these networks security becomes prominent. Different intrusion attacks to networks have been designed and the attackers are working on improving them. Thus the ability to detect intrusion with limited number of labeled data is desirable to provide networks with higher level of security. In this paper we design an intrusion detection system based on a deep neural network. The proposed system is based on self-supervised contrastive learning where a huge amount of unlabeled data can be used to generate informative representation suitable for various downstream tasks with limited number of labeled data. Using different experiments, we have shown that the proposed system presents an accuracy of 94.05% over the UNSW-NB15 dataset, an improvement of 4.22% in comparison to previous method based on self-supervised learning. Our simulations have also shown impressive results when the size of labeled training data is limited. The performance of the resulting Encoder Block trained on UNSW-NB15 dataset has also been tested on other datasets for representation extraction which shows competitive results in downstream tasks. △ Less

Submitted 31 March, 2023; v1 submitted 1 September, 2022; originally announced September 2022.

arXiv:2204.12314 [pdf, other]

doi 10.1103/PhysRevD.105.095024

Gravitational Effects on Quantum Coherence in Neutrino Oscillation

Authors: M. M. Ettefaghi, R. Ramezani Arani, Z. S. Tabatabaei Lotfi

Abstract: In this paper, we investigate the quantum coherence for two flavor neutrinos propagating in a Schwarzschild metric. In fact, this issue is explored both qualitatively via calculating the parameter $K_{3}$ in Leggett-Garg inequality (LGI) and also quantitatively by evaluating the $l_{1}$-norm, ${\cal C}(ρ)$. Using the weak field approximations, we show that the gravitational effects decrease the ma… ▽ More In this paper, we investigate the quantum coherence for two flavor neutrinos propagating in a Schwarzschild metric. In fact, this issue is explored both qualitatively via calculating the parameter $K_{3}$ in Leggett-Garg inequality (LGI) and also quantitatively by evaluating the $l_{1}$-norm, ${\cal C}(ρ)$. Using the weak field approximations, we show that the gravitational effects decrease the maximum value of $K_{3}$ for some intervals of energy such a way that there is no violation, while it leaves the maximum amount of the quantum coherence, ${\cal C}(ρ)$ unchanged. △ Less

Submitted 4 May, 2022; v1 submitted 26 April, 2022; originally announced April 2022.

Comments: 21 pages, 2 figures, to appear in Phys. Rev. D

arXiv:2202.11678 [pdf, other]

Bayesian Model Selection, the Marginal Likelihood, and Generalization

Authors: Sanae Lotfi, Pavel Izmailov, Gregory Benton, Micah Goldblum, Andrew Gordon Wilson

Abstract: How do we compare between hypotheses that are entirely consistent with observations? The marginal likelihood (aka Bayesian evidence), which represents the probability of generating our observations from a prior, provides a distinctive approach to this foundational question, automatically encoding Occam's razor. Although it has been observed that the marginal likelihood can overfit and is sensitive… ▽ More How do we compare between hypotheses that are entirely consistent with observations? The marginal likelihood (aka Bayesian evidence), which represents the probability of generating our observations from a prior, provides a distinctive approach to this foundational question, automatically encoding Occam's razor. Although it has been observed that the marginal likelihood can overfit and is sensitive to prior assumptions, its limitations for hyperparameter learning and discrete model comparison have not been thoroughly investigated. We first revisit the appealing properties of the marginal likelihood for learning constraints and hypothesis testing. We then highlight the conceptual and practical issues in using the marginal likelihood as a proxy for generalization. Namely, we show how marginal likelihood can be negatively correlated with generalization, with implications for neural architecture search, and can lead to both underfitting and overfitting in hyperparameter learning. We also re-examine the connection between the marginal likelihood and PAC-Bayes bounds and use this connection to further elucidate the shortcomings of the marginal likelihood for model selection. We provide a partial remedy through a conditional marginal likelihood, which we show is more aligned with generalization, and practically valuable for large-scale hyperparameter learning, such as in deep kernel learning. △ Less

Submitted 1 May, 2023; v1 submitted 23 February, 2022; originally announced February 2022.

Comments: Extended version. Shorter ICML version available at arXiv:2202.11678v2

arXiv:2111.14761 [pdf, other]

Adaptive First- and Second-Order Algorithms for Large-Scale Machine Learning

Authors: Sanae Lotfi, Tiphaine Bonniot de Ruisselet, Dominique Orban, Andrea Lodi

Abstract: In this paper, we consider both first- and second-order techniques to address continuous optimization problems arising in machine learning. In the first-order case, we propose a framework of transition from deterministic or semi-deterministic to stochastic quadratic regularization methods. We leverage the two-phase nature of stochastic optimization to propose a novel first-order algorithm with ada… ▽ More In this paper, we consider both first- and second-order techniques to address continuous optimization problems arising in machine learning. In the first-order case, we propose a framework of transition from deterministic or semi-deterministic to stochastic quadratic regularization methods. We leverage the two-phase nature of stochastic optimization to propose a novel first-order algorithm with adaptive sampling and adaptive step size. In the second-order case, we propose a novel stochastic damped L-BFGS method that improves on previous algorithms in the highly nonconvex context of deep learning. Both algorithms are evaluated on well-known deep learning datasets and exhibit promising performance. △ Less

Submitted 29 November, 2021; originally announced November 2021.

Comments: 29 pages, 8 figures. arXiv admin note: text overlap with arXiv:2012.05783

MSC Class: 68T07; 90C15; 90C30; 90C53 ACM Class: G.1.6; G.3; G.4; I.2.6

arXiv:2109.10430 [pdf, other]

GAP2WSS: A Genetic Algorithm based on the Pareto Principle for Web Service Selection

Authors: SayedHassan Khatoonabadi, Shahriar Lotfi, Ayaz Isazadeh

Abstract: Despite all the progress in Web service selection, the need for an approach with a better optimality and performance still remains. This paper presents a genetic algorithm by adopting the Pareto principle that is called GAP2WSS for selecting a Web service for each task of a composite Web service from a pool of candidate Web services. In contrast to the existing approaches, all global QoS constrain… ▽ More Despite all the progress in Web service selection, the need for an approach with a better optimality and performance still remains. This paper presents a genetic algorithm by adopting the Pareto principle that is called GAP2WSS for selecting a Web service for each task of a composite Web service from a pool of candidate Web services. In contrast to the existing approaches, all global QoS constraints, interservice constraints, and transactional constraints are considered simultaneously. At first, all candidate Web services are scored and ranked per each task using the proposed mechanism. Then, the top 20 percent of the candidate Web services of each task are considered as the candidate Web services of the corresponding task to reduce the problem search space. Finally, the Web service selection problem is solved by focusing only on these 20 percent candidate Web services of each task using a genetic algorithm. Empirical studies demonstrate this approach leads to a higher efficiency and efficacy as compared with the case that all the candidate Web services are considered in solving the problem. △ Less

Submitted 21 September, 2021; originally announced September 2021.

arXiv:2106.11905 [pdf, other]

Dangers of Bayesian Model Averaging under Covariate Shift

Authors: Pavel Izmailov, Patrick Nicholson, Sanae Lotfi, Andrew Gordon Wilson

Abstract: Approximate Bayesian inference for neural networks is considered a robust alternative to standard training, often providing good performance on out-of-distribution data. However, Bayesian neural networks (BNNs) with high-fidelity approximate inference via full-batch Hamiltonian Monte Carlo achieve poor generalization under covariate shift, even underperforming classical estimation. We explain this… ▽ More Approximate Bayesian inference for neural networks is considered a robust alternative to standard training, often providing good performance on out-of-distribution data. However, Bayesian neural networks (BNNs) with high-fidelity approximate inference via full-batch Hamiltonian Monte Carlo achieve poor generalization under covariate shift, even underperforming classical estimation. We explain this surprising result, showing how a Bayesian model average can in fact be problematic under covariate shift, particularly in cases where linear dependencies in the input features cause a lack of posterior contraction. We additionally show why the same issue does not affect many approximate inference procedures, or classical maximum a-posteriori (MAP) training. Finally, we propose novel priors that improve the robustness of BNNs to many sources of covariate shift. △ Less

Submitted 6 December, 2021; v1 submitted 22 June, 2021; originally announced June 2021.

Comments: NeurIPS 2021. Code is available at https://github.com/izmailovpavel/bnn_covariate_shift

arXiv:2102.13042 [pdf, other]

Loss Surface Simplexes for Mode Connecting Volumes and Fast Ensembling

Authors: Gregory W. Benton, Wesley J. Maddox, Sanae Lotfi, Andrew Gordon Wilson

Abstract: With a better understanding of the loss surfaces for multilayer networks, we can build more robust and accurate training procedures. Recently it was discovered that independently trained SGD solutions can be connected along one-dimensional paths of near-constant training loss. In this paper, we show that there are mode-connecting simplicial complexes that form multi-dimensional manifolds of low lo… ▽ More With a better understanding of the loss surfaces for multilayer networks, we can build more robust and accurate training procedures. Recently it was discovered that independently trained SGD solutions can be connected along one-dimensional paths of near-constant training loss. In this paper, we show that there are mode-connecting simplicial complexes that form multi-dimensional manifolds of low loss, connecting many independently trained models. Inspired by this discovery, we show how to efficiently build simplicial complexes for fast ensembling, outperforming independently trained deep ensembles in accuracy, calibration, and robustness to dataset shift. Notably, our approach only requires a few training epochs to discover a low-loss simplex, starting from a pre-trained solution. Code is available at https://github.com/g-benton/loss-surface-simplexes. △ Less

Submitted 15 November, 2021; v1 submitted 25 February, 2021; originally announced February 2021.

Comments: ICML 2021

arXiv:2012.05783 [pdf, other]

doi 10.13140/RG.2.2.27851.41765/1

Stochastic Damped L-BFGS with Controlled Norm of the Hessian Approximation

Authors: Sanae Lotfi, Tiphaine Bonniot de Ruisselet, Dominique Orban, Andrea Lodi

Abstract: We propose a new stochastic variance-reduced damped L-BFGS algorithm, where we leverage estimates of bounds on the largest and smallest eigenvalues of the Hessian approximation to balance its quality and conditioning. Our algorithm, VARCHEN, draws from previous work that proposed a novel stochastic damped L-BFGS algorithm called SdLBFGS. We establish almost sure convergence to a stationary point a… ▽ More We propose a new stochastic variance-reduced damped L-BFGS algorithm, where we leverage estimates of bounds on the largest and smallest eigenvalues of the Hessian approximation to balance its quality and conditioning. Our algorithm, VARCHEN, draws from previous work that proposed a novel stochastic damped L-BFGS algorithm called SdLBFGS. We establish almost sure convergence to a stationary point and a complexity bound. We empirically demonstrate that VARCHEN is more robust than SdLBFGS-VR and SVRG on a modified DavidNet problem -- a highly nonconvex and ill-conditioned problem that arises in the context of deep learning, and their performance is comparable on a logistic regression problem and a nonconvex support-vector machine problem. △ Less

Submitted 10 December, 2020; originally announced December 2020.

Comments: 14 pages, 4 figures

Report number: Cahier du GERAD G-2020-52 MSC Class: 68T07; 90C15; 90C30; 90C53 ACM Class: G.1.6; G.3; G.4; I.2.6

arXiv:2011.13010 [pdf, other]

doi 10.1209/0295-5075/132/31002

Quantum Correlations in Neutrino Oscillation: Coherence and Entanglement

Authors: M. M. Ettefaghi, Z. S. Tabatabaei Lotfi, R. Ramezani Arani

Abstract: In this paper, we consider the quantum correlations, coherence and entanglement, in neutrino oscillation. We find that the $l_{1}$-norm as a coherence measure is equal to sum of the three possible concurrences for measuring the entanglement among different flavor modes which were calculated in the paper by (M. Blasone et al., Europhys. Lett., {\bf 112}, 20007). Our result shows that the origin of… ▽ More In this paper, we consider the quantum correlations, coherence and entanglement, in neutrino oscillation. We find that the $l_{1}$-norm as a coherence measure is equal to sum of the three possible concurrences for measuring the entanglement among different flavor modes which were calculated in the paper by (M. Blasone et al., Europhys. Lett., {\bf 112}, 20007). Our result shows that the origin of the flavor entanglement in neutrino oscillation is the same as that of quantum coherence. Furthermore, in the wave packet framework, the variation of $l_{1}$-norm is investigated by varying the wave packet width $σ_{x}$. As it is expected the amount of coherence increases by $σ_{x}$ due to the increase in the overlap** of the mass eigenstates. △ Less

Submitted 25 November, 2020; originally announced November 2020.

Comments: 12 pages; 1 figure

arXiv:1308.3784 [pdf]

doi 10.5121/ijfcst.2013.3401

Graph Colouring Problem Based on Discrete Imperialist Competitive Algorithm

Authors: Hojjat Emami, Shahriar Lotfi

Abstract: In graph theory, Graph Colouring Problem (GCP) is an assignment of colours to vertices of any given graph such that the colours on adjacent vertices are different. The GCP is known to be an optimization and NP-hard problem. Imperialist Competitive Algorithm (ICA) is a meta-heuristic optimization and stochastic search strategy which is inspired from socio-political phenomenon of imperialistic compe… ▽ More In graph theory, Graph Colouring Problem (GCP) is an assignment of colours to vertices of any given graph such that the colours on adjacent vertices are different. The GCP is known to be an optimization and NP-hard problem. Imperialist Competitive Algorithm (ICA) is a meta-heuristic optimization and stochastic search strategy which is inspired from socio-political phenomenon of imperialistic competition. The ICA contains two main operators: the assimilation and the imperialistic competition. The ICA has excellent capabilities such as high convergence rate and better global optimum achievement. In this research, a discrete version of ICA is proposed to deal with the solution of GCP. We call this algorithm as the DICA. The performance of the proposed method is compared with Genetic Algorithm (GA) on seven well-known graph colouring benchmarks. Experimental results demonstrate the superiority of the DICA for the benchmarks. This means DICA can produce optimal and valid solutions for different GCP instances. △ Less

Submitted 17 August, 2013; originally announced August 2013.

Comments: 12 pages

Journal ref: International Journal in Foundations of Computer Science & Technology (IJFCST), Vol. 3, No.4, July 2013, pp. 1-12

Showing 1–12 of 12 results for author: Lotfi, S