-
AdaFisher: Adaptive Second Order Optimization via Fisher Information
Authors:
Damien Martins Gomes,
Yanlei Zhang,
Eugene Belilovsky,
Guy Wolf,
Mahdi S. Hosseini
Abstract:
First-order optimization methods are currently the mainstream in training deep neural networks (DNNs). Optimizers like Adam incorporate limited curvature information by employing the diagonal matrix preconditioning of the stochastic gradient during the training. Despite their widespread, second-order optimization algorithms exhibit superior convergence properties compared to their first-order coun…
▽ More
First-order optimization methods are currently the mainstream in training deep neural networks (DNNs). Optimizers like Adam incorporate limited curvature information by employing the diagonal matrix preconditioning of the stochastic gradient during the training. Despite their widespread, second-order optimization algorithms exhibit superior convergence properties compared to their first-order counterparts e.g. Adam and SGD. However, their practicality in training DNNs are still limited due to increased per-iteration computations and suboptimal accuracy compared to the first order methods. We present AdaFisher--an adaptive second-order optimizer that leverages a block-diagonal approximation to the Fisher information matrix for adaptive gradient preconditioning. AdaFisher aims to bridge the gap between enhanced convergence capabilities and computational efficiency in second-order optimization framework for training DNNs. Despite the slow pace of second-order optimizers, we showcase that AdaFisher can be reliably adopted for image classification, language modelling and stand out for its stability and robustness in hyperparameter tuning. We demonstrate that AdaFisher outperforms the SOTA optimizers in terms of both accuracy and convergence speed. Code available from \href{https://github.com/AtlasAnalyticsLab/AdaFisher}{https://github.com/AtlasAnalyticsLab/AdaFisher}
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
On Reversing Operator Choi-Davis-Jensen inequality
Authors:
Seyyed Saeid Hashemi Karouei,
Mohammad Sadegh Asgari,
Mohsen Shah Hosseini
Abstract:
In this paper, we first provide a better estimate of the second inequality in Hermite-Hadamard inequality. Next, we study the reverse of the celebrated Davis-Choi-Jensen's inequality. Our results are employed to establish a new bound for the operator Kantorovich inequality.
In this paper, we first provide a better estimate of the second inequality in Hermite-Hadamard inequality. Next, we study the reverse of the celebrated Davis-Choi-Jensen's inequality. Our results are employed to establish a new bound for the operator Kantorovich inequality.
△ Less
Submitted 6 April, 2021;
originally announced April 2021.
-
AdaS: Adaptive Scheduling of Stochastic Gradients
Authors:
Mahdi S. Hosseini,
Konstantinos N. Plataniotis
Abstract:
The choice of step-size used in Stochastic Gradient Descent (SGD) optimization is empirically selected in most training procedures. Moreover, the use of scheduled learning techniques such as Step-Decaying, Cyclical-Learning, and Warmup to tune the step-size requires extensive practical experience--offering limited insight into how the parameters update--and is not consistent across applications. T…
▽ More
The choice of step-size used in Stochastic Gradient Descent (SGD) optimization is empirically selected in most training procedures. Moreover, the use of scheduled learning techniques such as Step-Decaying, Cyclical-Learning, and Warmup to tune the step-size requires extensive practical experience--offering limited insight into how the parameters update--and is not consistent across applications. This work attempts to answer a question of interest to both researchers and practitioners, namely \textit{"how much knowledge is gained in iterative training of deep neural networks?"} Answering this question introduces two useful metrics derived from the singular values of the low-rank factorization of convolution layers in deep neural networks. We introduce the notions of \textit{"knowledge gain"} and \textit{"map** condition"} and propose a new algorithm called Adaptive Scheduling (AdaS) that utilizes these derived metrics to adapt the SGD learning rate proportionally to the rate of change in knowledge gain over successive iterations. Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training. Code is available at \url{https://github.com/mahdihosseini/AdaS}.
△ Less
Submitted 11 June, 2020;
originally announced June 2020.
-
On the Closed Form Expression of Elementary Symmetric Polynomials and the Inverse of Vandermonde Matrix
Authors:
Mahdi S. Hosseini,
Alfred Chen,
Konstantinos N. Plataniotis
Abstract:
Inverse Vandermonde matrix calculation is a long-standing problem to solve nonsingular linear system $Vc=b$ where the rows of a square matrix $V$ are constructed by progression of the power polynomials. It has many applications in scientific computing including interpolation, super-resolution, and construction of special matrices applied in cryptography. Despite its numerous applications, the matr…
▽ More
Inverse Vandermonde matrix calculation is a long-standing problem to solve nonsingular linear system $Vc=b$ where the rows of a square matrix $V$ are constructed by progression of the power polynomials. It has many applications in scientific computing including interpolation, super-resolution, and construction of special matrices applied in cryptography. Despite its numerous applications, the matrix is highly ill-conditioned where specialized treatments are considered for approximation such as conversion to Cauchy matrix, spectral decomposition, and algorithmic tailoring of the numerical solutions. In this paper, we propose a generalized algorithm that takes arbitrary pairwise (non-repetitive) sample nodes for solving inverse Vandermonde matrix. This is done in two steps: first, a highly balanced recursive algorithm is introduced with $\mathcal{O}(N)$ complexity to solve the combinatorics summation of the elementary symmetric polynomials; and second, a closed-form solution is tailored for inverse Vandermonde where the matrix' elements utilize this recursive summation for the inverse calculations. The numerical stability and accuracy of the proposed inverse method is analyzed through the spectral decomposition of the Frobenius companion matrix that associates with the corresponding Vandermonde matrix. The results show significant improvement over the state-of-the-art solutions using specific nodes such as $N$th roots of unity defined on the complex plane. A basic application in one dimensional interpolation problem is considered to demonstrate the utility of the proposed method for super-resolved signals.
△ Less
Submitted 17 September, 2019;
originally announced September 2019.
-
On the Operator Jensen Inequality for Convex Functions
Authors:
M. Shah Hosseini,
H. R. Moradi,
B. Moosavi
Abstract:
This paper is mainly devoted to studying operator Jensen inequality. More precisely, a new generalization of Jensen inequality and its reverse version for convex (not necessary operator convex) functions have been proved. Several special cases are discussed as well.
This paper is mainly devoted to studying operator Jensen inequality. More precisely, a new generalization of Jensen inequality and its reverse version for convex (not necessary operator convex) functions have been proved. Several special cases are discussed as well.
△ Less
Submitted 7 June, 2019; v1 submitted 30 May, 2019;
originally announced May 2019.
-
An Alternative Estimate for the Numerical Radius of Hilbert Space Operators
Authors:
M. Shah Hosseini,
B. Moosavi,
H. R. Moradi
Abstract:
We give an alternative lower bound for the numerical radii of Hilbert space operators. As a by-product, we find conditions such that
\begin{equation*} ω\left(\left[\begin{array}{cc} 0 & R \\ S & 0 \end{array}\right]\right)=\frac{\Vert R \Vert +\Vert S\Vert }{2} \end{equation*} where $R, S \in \mathbb{B}(\mathcal{H})$.
We give an alternative lower bound for the numerical radii of Hilbert space operators. As a by-product, we find conditions such that
\begin{equation*} ω\left(\left[\begin{array}{cc} 0 & R \\ S & 0 \end{array}\right]\right)=\frac{\Vert R \Vert +\Vert S\Vert }{2} \end{equation*} where $R, S \in \mathbb{B}(\mathcal{H})$.
△ Less
Submitted 27 March, 2019; v1 submitted 24 May, 2018;
originally announced May 2018.
-
Finite Differences in Forward and Inverse Imaging Problems--MaxPol Design
Authors:
Mahdi S. Hosseini,
Konstantinos N. Plataniotis
Abstract:
A systematic and comprehensive framework for finite impulse response (FIR) lowpass/fullband derivative kernels is introduced in this paper. Closed form solutions of a number of derivative filters are obtained using the maximally flat technique to regulate the Fourier response of undetermined coefficients. The framework includes arbitrary parameter control methods that afford solutions for numerous…
▽ More
A systematic and comprehensive framework for finite impulse response (FIR) lowpass/fullband derivative kernels is introduced in this paper. Closed form solutions of a number of derivative filters are obtained using the maximally flat technique to regulate the Fourier response of undetermined coefficients. The framework includes arbitrary parameter control methods that afford solutions for numerous differential orders, variable polynomial accuracy, centralized/staggered schemes, and arbitrary side-shift nodes for boundary formulation. Using the proposed framework four different derivative matrix operators are introduced and their numerical stability is analyzed by studying their eigenvalues distribution in the complex plane. Their utility is studied by considering two important image processing problems, namely gradient surface reconstruction and image stitching. Experimentation indicates that the new derivative matrices not only outperform commonly used method but provide useful insights to the numerical issues in these two applications.
△ Less
Submitted 25 September, 2017;
originally announced September 2017.
-
High-Accuracy Total Variation for Compressed Video Sensing
Authors:
Mahdi S. Hosseini,
Konstantinos N. Plataniotis
Abstract:
Numerous total variation (TV) regularizers, engaged in image restoration problem, encode the gradients by means of simple $[-1,1]$ FIR filter. Despite its low computational processing, this filter severely deviates signal's high frequency components pertinent to edge/discontinuous information and cause several deficiency issues known as texture and geometric loss. This paper addresses this problem…
▽ More
Numerous total variation (TV) regularizers, engaged in image restoration problem, encode the gradients by means of simple $[-1,1]$ FIR filter. Despite its low computational processing, this filter severely deviates signal's high frequency components pertinent to edge/discontinuous information and cause several deficiency issues known as texture and geometric loss. This paper addresses this problem by proposing an alternative model to the TV regularization problem via high order accuracy differential FIR filters to preserve rapid transitions in signal recovery. A numerical encoding scheme is designed to extend the TV model into multidimensional representation (tensorial decomposition). We adopt this design to regulate the spatial and temporal redundancy in compressed video sensing problem to jointly recover frames from under-sampled measurements. We then seek the solution via alternating direction methods of multipliers and find a unique solution to quadratic minimization step with capability of handling different boundary conditions. The resulting algorithm uses much lower sampling rate and highly outperforms alternative state-of-the-art methods. This is evaluated both in terms of restoration accuracy and visual quality of the recovered frames.
△ Less
Submitted 4 March, 2014; v1 submitted 1 September, 2013;
originally announced September 2013.