Search | arXiv e-print repository

Quantization of Large Language Models with an Overdetermined Basis

Authors: Daniil Merkulov, Daria Cherniuk, Alexander Rudikov, Ivan Oseledets, Ekaterina Muravleva, Aleksandr Mikhalev, Boris Kashin

Abstract: In this paper, we introduce an algorithm for data quantization based on the principles of Kashin representation. This approach hinges on decomposing any given vector, matrix, or tensor into two factors. The first factor maintains a small infinity norm, while the second exhibits a similarly constrained norm when multiplied by an orthogonal matrix. Surprisingly, the entries of factors after decompos… ▽ More In this paper, we introduce an algorithm for data quantization based on the principles of Kashin representation. This approach hinges on decomposing any given vector, matrix, or tensor into two factors. The first factor maintains a small infinity norm, while the second exhibits a similarly constrained norm when multiplied by an orthogonal matrix. Surprisingly, the entries of factors after decomposition are well-concentrated around several peaks, which allows us to efficiently replace them with corresponding centroids for quantization purposes. We study the theoretical properties of the proposed approach and rigorously evaluate our compression algorithm in the context of next-word prediction tasks and on a set of downstream tasks for text classification. Our findings demonstrate that Kashin Quantization achieves competitive or superior quality in model performance while ensuring data compression, marking a significant advancement in the field of data quantization. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2209.14937 [pdf, other]

NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizer

Authors: Valentin Leplat, Daniil Merkulov, Aleksandr Katrutsa, Daniel Bershatsky, Olga Tsymboi, Ivan Oseledets

Abstract: Classical machine learning models such as deep neural networks are usually trained by using Stochastic Gradient Descent-based (SGD) algorithms. The classical SGD can be interpreted as a discretization of the stochastic gradient flow. In this paper we propose a novel, robust and accelerated stochastic optimizer that relies on two key elements: (1) an accelerated Nesterov-like Stochastic Differentia… ▽ More Classical machine learning models such as deep neural networks are usually trained by using Stochastic Gradient Descent-based (SGD) algorithms. The classical SGD can be interpreted as a discretization of the stochastic gradient flow. In this paper we propose a novel, robust and accelerated stochastic optimizer that relies on two key elements: (1) an accelerated Nesterov-like Stochastic Differential Equation (SDE) and (2) its semi-implicit Gauss-Seidel type discretization. The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively in the case of the minimization of a quadratic function. This analysis allows us to come up with an optimal learning rate in terms of the convergence rate while ensuring the stability of NAG-GS. This is achieved by the careful analysis of the spectral radius of the iteration matrix and the covariance matrix at stationarity with respect to all hyperparameters of our method. Further, we show that NAG- GS is competitive with state-of-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models such as the logistic regression model, the residual networks models on standard computer vision datasets, Transformers in the frame of the GLUE benchmark and the recent Vision Transformers. △ Less

Submitted 30 September, 2023; v1 submitted 29 September, 2022; originally announced September 2022.

Comments: We study Nesterov acceleration for the Stochastic Differential Equation

arXiv:2201.13195 [pdf, other]

Memory-Efficient Backpropagation through Large Linear Layers

Authors: Daniel Bershatsky, Aleksandr Mikhalev, Alexandr Katrutsa, Julia Gusak, Daniil Merkulov, Ivan Oseledets

Abstract: In modern neural networks like Transformers, linear layers require significant memory to store activations during backward pass. This study proposes a memory reduction approach to perform backpropagation through linear layers. Since the gradients of linear layers are computed by matrix multiplications, we consider methods for randomized matrix multiplications and demonstrate that they require less… ▽ More In modern neural networks like Transformers, linear layers require significant memory to store activations during backward pass. This study proposes a memory reduction approach to perform backpropagation through linear layers. Since the gradients of linear layers are computed by matrix multiplications, we consider methods for randomized matrix multiplications and demonstrate that they require less memory with a moderate decrease of the test accuracy. Also, we investigate the variance of the gradient estimate induced by the randomized matrix multiplication. We compare this variance with the variance coming from gradient estimation based on the batch of samples. We demonstrate the benefits of the proposed method on the fine-tuning of the pre-trained RoBERTa model on GLUE tasks. △ Less

Submitted 2 February, 2022; v1 submitted 31 January, 2022; originally announced January 2022.

Comments: Submitted

arXiv:2110.00874 [pdf, other]

Fast Line Search for Multi-Task Learning

Authors: Andrey Filatov, Daniil Merkulov

Abstract: Multi-task learning is a powerful method for solving several tasks jointly by learning robust representation. Optimization of the multi-task learning model is a more complex task than a single-task due to task conflict. Based on theoretical results, convergence to the optimal point is guaranteed when step size is chosen through line search. But, usually, line search for the step size is not the be… ▽ More Multi-task learning is a powerful method for solving several tasks jointly by learning robust representation. Optimization of the multi-task learning model is a more complex task than a single-task due to task conflict. Based on theoretical results, convergence to the optimal point is guaranteed when step size is chosen through line search. But, usually, line search for the step size is not the best choice due to the large computational time overhead. We propose a novel idea for line search algorithms in multi-task learning. The idea is to use latent representation space instead of parameter space for finding step size. We examined this idea with backtracking line search. We compare this fast backtracking algorithm with classical backtracking and gradient methods with a constant learning rate on MNIST, CIFAR-10, Cityscapes tasks. The systematic empirical study showed that the proposed method leads to more accurate and fast solution, than the traditional backtracking approach and keep competitive computational time and performance compared to the constant learning rate method. △ Less

Submitted 2 October, 2021; originally announced October 2021.

arXiv:2007.06937 [pdf, other]

Follow the bisector: a simple method for multi-objective optimization

Authors: Alexandr Katrutsa, Daniil Merkulov, Nurislam Tursynbek, Ivan Oseledets

Abstract: This study presents a novel Equiangular Direction Method (EDM) to solve a multi-objective optimization problem. We consider optimization problems, where multiple differentiable losses have to be minimized. The presented method computes descent direction in every iteration to guarantee equal relative decrease of objective functions. This descent direction is based on the normalized gradients of the… ▽ More This study presents a novel Equiangular Direction Method (EDM) to solve a multi-objective optimization problem. We consider optimization problems, where multiple differentiable losses have to be minimized. The presented method computes descent direction in every iteration to guarantee equal relative decrease of objective functions. This descent direction is based on the normalized gradients of the individual losses. Therefore, it is appropriate to solve multi-objective optimization problems with multi-scale losses. We test the proposed method on the imbalanced classification problem and multi-task learning problem, where standard datasets are used. EDM is compared with other methods to solve these problems. △ Less

Submitted 14 July, 2020; originally announced July 2020.

arXiv:2004.08981 [pdf, other]

Stochastic gradient algorithms from ODE splitting perspective

Authors: Daniil Merkulov, Ivan Oseledets

Abstract: We present a different view on stochastic optimization, which goes back to the splitting schemes for approximate solutions of ODE. In this work, we provide a connection between stochastic gradient descent approach and first-order splitting scheme for ODE. We consider the special case of splitting, which is inspired by machine learning applications and derive a new upper bound on the global splitti… ▽ More We present a different view on stochastic optimization, which goes back to the splitting schemes for approximate solutions of ODE. In this work, we provide a connection between stochastic gradient descent approach and first-order splitting scheme for ODE. We consider the special case of splitting, which is inspired by machine learning applications and derive a new upper bound on the global splitting error for it. We present, that the Kaczmarz method is the limit case of the splitting scheme for the unit batch SGD for linear least squares problem. We support our findings with systematic empirical studies, which demonstrates, that a more accurate solution of local problems leads to the stepsize robustness and provides better convergence in time and iterations on the softmax regression problem. △ Less

Submitted 19 April, 2020; originally announced April 2020.

arXiv:1906.06295 [pdf, other]

doi 10.1134/S1064226919120118

Empirical study of extreme overfitting points of neural networks

Authors: Daniil Merkulov, Ivan Oseledets

Abstract: In this paper we propose a method of obtaining points of extreme overfitting - parameters of modern neural networks, at which they demonstrate close to 100 % training accuracy, simultaneously with almost zero accuracy on the test sample. Despite the widespread opinion that the overwhelming majority of critical points of the loss function of a neural network have equally good generalizing ability,… ▽ More In this paper we propose a method of obtaining points of extreme overfitting - parameters of modern neural networks, at which they demonstrate close to 100 % training accuracy, simultaneously with almost zero accuracy on the test sample. Despite the widespread opinion that the overwhelming majority of critical points of the loss function of a neural network have equally good generalizing ability, such points have a huge generalization error. The paper studies the properties of such points and their location on the surface of the loss function of modern neural networks. △ Less

Submitted 3 July, 2019; v1 submitted 14 June, 2019; originally announced June 2019.

Journal ref: J. Commun. Technol. Electron. 64, 1527-1534 (2019)

Showing 1–7 of 7 results for author: Merkulov, D