License: arXiv.org perpetual non-exclusive license
arXiv:2403.11041v1 [cs.LG] 16 Mar 2024

FAGH: Accelerating Federated Learning with Approximated Global Hessian

Mrinmay Sen1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT    A. K. Qin11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT    Krishna Mohan C22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTDept. of Computing Technologies, Swinburne University of Technology, Hawthorn, VIC, Australia
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTDept. of Artificial Intelligence, Indian Institute of Technology Hyderabad, Hyderabad, India
Abstract

In federated learning (FL), the significant communication overhead due to the slow convergence speed of training the global model poses a great challenge. Specifically, a large number of communication rounds are required to achieve the convergence in FL. One potential solution is to employ the Newton-based optimization method for training, known for its quadratic convergence rate. However, the existing Newton-based FL training methods suffer from either memory inefficiency or high computational costs for local clients or the server. To address this issue, we propose an FL with approximated global Hessian (FAGH) method to accelerate FL training. FAGH leverages the first moment of the approximated global Hessian and the first moment of the global gradient to train the global model. By harnessing the approximated global Hessian curvature, FAGH accelerates the convergence of global model training, leading to the reduced number of communication rounds and thus the shortened training time. Experimental results verify FAGH’s effectiveness in decreasing the number of communication rounds and the time required to achieve the pre-specified objectives of the global model performance in terms of training and test losses as well as test accuracy. Notably, FAGH outperforms several state-of-the-art FL training methods.

1 Introduction

In centralized learning, all data is collected in one place and used to train a machine learning model. Despite its high performance, centralized learning poses risks of privacy leakage and communication overhead when collecting data from different sources or clients. These challenges motivate the transition to federated learning from centralized learning. In the most popular and baseline algorithm of federated learning, FedAvg McMahan et al. (2017), locally trained models are collected on the server instead of raw data, and the server aggregates all local models to find the global model, which is then sent back to all clients for further training. Sending local models instead of raw data to the server helps overcome data transfer challenges. One communication round of FedAvg involves two communications: sending the global model from the server to all available clients and sharing locally trained models with the server, resulting in a communication cost of O(2d). Since FedAvg employs a first-order stochastic gradient descent optimizer Ketkar and Ketkar (2017) to update local models, it is computationally efficient, with a local time complexity of O(d), where d represents the number of model parameters. FedAvg performs well when data are homogeneously distributed across all clients Li et al. (2020c), a scenario which is rarely encountered in real-life applications. In cases of heterogeneous data distribution, FedAvg suffers from objective inconsistency Karimireddy et al. (2020); Li et al. (2020b, c); Tan et al. (2021); Wang et al. (2020b). This inconsistency occurs when the global model converges to a stationary point that may not be the optimum of the global objective function, resulting in slow training of the global model, increased number of communication rounds, and time required to achieve a certain performance level. To accelerate FL training in scenarios of heterogeneous data distribution, several modifications have been proposed for FedAvg, including FedProx Li et al. (2020b), FedNova Wang et al. (2020b), SCAFFOLD Karimireddy et al. (2020), MOON Li et al. (2021), FedDC Gao et al. (2022), pFedMe Dinh et al. (2020), FedGA Dandi et al. (2022), FedExP Jhunjhunwala et al. (2023), among others. Although these modifications generally outperform FedAvg, the learning of the global model remains slower when aiming for a targeted performance, as these methods primarily utilize first-order gradient information for optimizing model parameters. Additionally, these methods are highly sensitive to hyperparameter choices.

To further accelerate FL training, researchers have shifted their attention from first-order optimization to the second-order Newton method of optimization due to its higher convergence rate compared to first-order methods Agarwal et al. (2017); Tankaria et al. (2021). Although the Newton method of optimization outperforms first-order optimizations in term of convergence, there are challenges in calculating and storing the Hessian and its inverse for large-scale settings (with time complexities of O(d2)𝑂superscript𝑑2O(d^{2})italic_O ( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and O(d3)𝑂superscript𝑑3O(d^{3})italic_O ( italic_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) respectively for calculating the Hessian and its inverse, and a space complexity of O(d2)𝑂superscript𝑑2O(d^{2})italic_O ( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) for storing them). To address these challenges, researchers have focused on approximating the Hessian Agarwal et al. (2017); Liu and Nocedal (1989); Martens and Grosse (2015); Tankaria et al. (2021); Nazareth (2009); Vuchkov (2022) instead of using the true Hessian while optimizing model parameters. In federated learning, another issue arises when local models are updated using the Newton method of optimization. Since the Newton method of optimization utilizes the Hessian inverse for updating model parameters, averaging all locally trained models to find the global model is not feasible Derezinski and Mahoney (2019). State-of-the-art solutions to these problems can be categorized into three approaches. The first approach involves computing the local Hessian matrix and sending it to the server in a compressed form, where the server aggregates all local Hessian information to determine the global Newton direction Safaryan et al. (2022); Qian et al. (2022). The second approach is based on finding the local Newton direction with global gradient information Wang et al. (2018); Dinh et al. (2022). The third approach utilizes the Quasi-Newton method of optimization Ma et al. (2022), where local first-order information is collected and aggregated on the server to find the global Newton direction. Existing Newton method-based federated optimizations Bischoff et al. (2021) include DANE Shamir et al. (2014), GIANT Wang et al. (2018), FedDANE Li et al. (2020a), FedSSO Ma et al. (2022), FedNL Safaryan et al. (2022), Basis Matters Qian et al. (2022), DONE Dinh et al. (2022), etc., which are either computationally expensive for local clients due to the calculation of the Hessian matrix, memory inefficient due to storing the Hessian matrix, or associated with four times communication in each FL communication round.

To expedite FL training, this paper introduces FAGH, a Newton method-based federated learning algorithm that eliminates the need for four separate communications, as seen in approaches like DONE or GIANT. FAGH approximates the global Newton direction on the server without computing and storing the full Hessian matrix. In FAGH, the server collects gradients and the first row of the true Hessian from each local client. Utilizing the first moment of the average gradient and the first moment of the average first row of the true Hessian across all clients, the server determines the global Newton direction and updates the global model. By leveraging the approximated global Hessian curvature, FAGH accelerates the convergence of global model training, resulting in a reduced number of communication rounds and shorter training times. Experimental results confirm FAGH’s effectiveness in reducing the required number of communication rounds to achieve predefined objectives for global model performance, including training and test losses, as well as test accuracy. Notably, FAGH outperforms several state-of-the-art FL training methods. Since FAGH utilizes only the first row of the true Hessian when determining the Newton direction, it can significantly reduce local time and space complexities compared to existing second-order FL algorithms.

The main contributions of FAGH are as follows.

  • In FAGH, each client finds gradient and Hessian’s first row of the local loss function and sends these to the server. Server finds the first moments of global gradient and global Hessian’s first row.

  • FAGH directly finds global Newton direction with the help of these first moments of global gradient and first row of the global Hessian without storing and calculating the full global Hessian matrix in the server.

  • Use of this directly computed global Newton direction leads to faster training in federated learning with linear time computational and space complexities.

The rest of the paper are arranged as follows. Section 2 elaborates related works, Section 3 elaborates the problem formulation, section 4 gives basic Preliminaries, section 5 discusses about our proposed method, section 6 shows the experimental setup and results and section 7 concludes our whole works.

2 Related Work

The related works can be classified into two categories: first-order based and second-order based FL approaches.

Existings first-order based approaches consist of FedProx, FedNova, SCAFFOLD, MOON, FedDC, pFedMe, FedGA, FedExP etc. In FedProx, the local objective function or loss function is modified by incorporating a proximal term(μ𝜇\muitalic_μ), which helps to control the direction of local gradient. Instead of using simple average or weighted average in server , FedNova uses normalized averaging to get the global model. To control drastically fluctuation of local gradient, which is caused by data heterogeneity, SCAFFOLD uses variance reduction while updating local model. To correct the local training, MOON conducts model-level contrastive learning where the similarity between model representations is utilized. To control the update of local model, FedDC uses an auxiliary local drift variable. Moreau envelopes regularized loss function is used in pFedMe. FedGA finds the displacement of the local gradient with respect to the global gradient and uses this while initiating local models. To speed up FL training, FedExP adaptively finds the server step size or learning rate by using extrapolation mechanism of Projection Onto Convex Sets (POCS) algorithm.

Existings second-order based approaches include DANE, GIANT, FedDANE, FedSSO, FedNL, Basis Matters, DONE etc. FedSSO utilizes server based Quasi-Newton method on global gradient (average gradient across all the clients) to find the global Newton direction. FedSSO has the same local time complexity as FedAvg (O(d)), but it involves with storing of the full Hessian matrix in the server, which may not be practical in the large scale settings to sustain the server space complexity of O(d2)𝑂superscript𝑑2O(d^{2})italic_O ( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). DANE, GIANT and FedDANE utilize conjugate gradient method to approximate the Hessian and DONE uses Richardson Iteration. DANE, GIANT, FedDANE and DONE require global gradient communicated by the server while finding local Newton direction, which increases total time for one FL iteration (One FL iteration of these methods consist of four separate communications- 1. Sending initial model from server to clients 2. Sending local gradients from clients to server 3. Sending average gradient from server to clients and 4. Sending locally updated models from clients to server ). Utilization of conjugate gradient method or Richardson Iteration involves time complexity of O(md+d), where m is number of conjugate gradient or Richardson iterations, which increases local time complexity by m times than FedAvg. FedNL and Basis Matters are associated with finding local Hessian information and sending it to the server in a compressed form. FedNL and Basis Matters store previous step’s Hessian matrix to approximate current step’s Hessian. Storing, calculation and compression of local Hessian results in additional computational and memory load to the local clients.

3 Problem formulation

Let C={C1,C2,..,CK}C=\{C_{1},C_{2},..,C_{K}\}italic_C = { italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , . . , italic_C start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT } is the set of participating clients in federated learning, where K is number of clients. Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the dataset owned by Client Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The goal of federated learning is to find the optima of the global objective function F(w)𝐹𝑤F(w)italic_F ( italic_w ) for-all\forall wRd𝑤superscript𝑅𝑑w\in R^{d}italic_w ∈ italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT as mentioned in eq.1

minwF(w)=i=1KpiFi(w;Di)subscript𝑤𝐹𝑤subscriptsuperscript𝐾𝑖1subscript𝑝𝑖subscript𝐹𝑖𝑤subscript𝐷𝑖\min_{w}F(w)=\sum^{K}_{i=1}p_{i}F_{i}(w;D_{i})roman_min start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_F ( italic_w ) = ∑ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w ; italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (1)

where, w𝑤witalic_w is model parameters, Fi(w)=1|Di|ξjDifj(w;ξj)subscript𝐹𝑖𝑤1subscript𝐷𝑖subscriptsubscript𝜉𝑗subscript𝐷𝑖subscript𝑓𝑗𝑤subscript𝜉𝑗F_{i}(w)=\frac{1}{|D_{i}|}\sum_{\xi_{j}\in D_{i}}f_{j}(w;\xi_{j})italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w ) = divide start_ARG 1 end_ARG start_ARG | italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_w ; italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is the average loss of Client Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT computed on dataset Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is local loss for sample ξjDisubscript𝜉𝑗subscript𝐷𝑖\xi_{j}\in D_{i}italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and i=1Kpi=1superscriptsubscript𝑖1𝐾subscript𝑝𝑖1\sum_{i=1}^{K}p_{i}=1∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1.

4 Preliminaries

4.1 Newton method of optimization

Newton method of optimization is similar to first-order stochastic gradient descent (SGD) Ketkar and Ketkar (2017) with only difference in finding update direction. In SGD, the gradient of objective function is scaled by a learning rate parameter η𝜂\etaitalic_η to find update direction, which is shown in eq.2. But in Newton method (eq. 3), the update direction is found by scaling the gradient with the help of inverse of the true Hessian (H), which incorporates curvature information while searching for optima of the objective function. Use of second-order Hessian curvature while optimizing model parameters leads to quadratic convergence rate Agarwal et al. (2017), which motivates us to use Newton method in FL for accelerating global model training.

wt=wt1ηgt1subscript𝑤𝑡subscript𝑤𝑡1𝜂subscript𝑔𝑡1w_{t}=w_{t-1}-\eta g_{t-1}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_η italic_g start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT (2)
wt=wt1H1gt1subscript𝑤𝑡subscript𝑤𝑡1superscript𝐻1subscript𝑔𝑡1w_{t}=w_{t-1}-H^{-1}g_{t-1}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT (3)

4.2 Sherman Morrison formula of matrix inversion

The inverse of the the matrix (B + ZVT𝑍superscript𝑉𝑇ZV^{T}italic_Z italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT) Rd×dabsentsuperscript𝑅𝑑𝑑\in R^{d\times d}∈ italic_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT can be calculated using Sherman Morrison formula of matrix inversion as shown in below equation.

(B+ZVT)1=B1B1ZVTB11+VTB1Zsuperscript𝐵𝑍superscript𝑉𝑇1superscript𝐵1superscript𝐵1𝑍superscript𝑉𝑇superscript𝐵11superscript𝑉𝑇superscript𝐵1𝑍(B+ZV^{T})^{-1}=B^{-1}-\frac{B^{-1}ZV^{T}B^{-1}}{1+V^{T}B^{-1}Z}( italic_B + italic_Z italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - divide start_ARG italic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_Z italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_Z end_ARG (4)

Where, B Rd×dabsentsuperscript𝑅𝑑𝑑\in R^{d\times d}∈ italic_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT be a invertible square matrix and Z, V Rd×1absentsuperscript𝑅𝑑1\in R^{d\times 1}∈ italic_R start_POSTSUPERSCRIPT italic_d × 1 end_POSTSUPERSCRIPT are column vectors.

Algorithm 1 FAGH
0:  Input: T𝑇Titalic_T: Number of global epochs, w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: Initial global model, η𝜂\etaitalic_η: learning rate, ρ𝜌\rhoitalic_ρ: Hessian regularization parameter, {β1,β2}[0,1)\beta_{1},\beta_{2}\}\in[0,1)italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ∈ [ 0 , 1 ): Exponential decay rates for the moment estimates, {M100,M200formulae-sequencesuperscriptsubscript𝑀100superscriptsubscript𝑀200M_{1}^{0}\leftarrow 0,M_{2}^{0}\leftarrow 0italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ← 0 , italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ← 0}: Initial moment vectors which are initialized with zero
1:  for t=1𝑡1t=1italic_t = 1 to T𝑇Titalic_T do
2:     Randomly pick a subset of clients C¯C¯𝐶𝐶\overline{C}\subseteq Cover¯ start_ARG italic_C end_ARG ⊆ italic_C
3:     Server sends global model wt1subscript𝑤𝑡1w_{t-1}italic_w start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to all the available clients C¯¯𝐶\overline{C}over¯ start_ARG italic_C end_ARG
4:     In clients:
5:     for i=1𝑖1i=1italic_i = 1 to |C¯|¯𝐶|\overline{C}|| over¯ start_ARG italic_C end_ARG | do
6:        Client Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT finds local gradient gitsuperscriptsubscript𝑔𝑖𝑡g_{i}^{t}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT= Fi(wt1,Di)wt1subscript𝐹𝑖subscript𝑤𝑡1subscript𝐷𝑖subscript𝑤𝑡1\frac{\partial F_{i}(w_{t-1},D_{i})}{\partial w_{t-1}}divide start_ARG ∂ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_w start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG and first row of the true hessian vitsuperscriptsubscript𝑣𝑖𝑡v_{i}^{t}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT= git[0]wt1superscriptsubscript𝑔𝑖𝑡delimited-[]0subscript𝑤𝑡1\frac{\partial g_{i}^{t}[0]}{\partial w_{t-1}}divide start_ARG ∂ italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT [ 0 ] end_ARG start_ARG ∂ italic_w start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG, where git[0]superscriptsubscript𝑔𝑖𝑡delimited-[]0g_{i}^{t}[0]italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT [ 0 ] is the first element of gitsuperscriptsubscript𝑔𝑖𝑡g_{i}^{t}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
7:     end for
8:     In server:
9:     Collect gitsuperscriptsubscript𝑔𝑖𝑡g_{i}^{t}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and vitsuperscriptsubscript𝑣𝑖𝑡v_{i}^{t}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT from all the clients
10:     Aggregate all gitsuperscriptsubscript𝑔𝑖𝑡g_{i}^{t}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to find global (average) gradient gt=pigitsuperscript𝑔𝑡subscript𝑝𝑖superscriptsubscript𝑔𝑖𝑡g^{t}=\sum p_{i}g_{i}^{t}italic_g start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ∑ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and aggregate all vitsuperscriptsubscript𝑣𝑖𝑡v_{i}^{t}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to find first row of the global (average) Hessian vt=pivitsuperscript𝑣𝑡subscript𝑝𝑖superscriptsubscript𝑣𝑖𝑡v^{t}=\sum p_{i}v_{i}^{t}italic_v start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ∑ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT across all the clients
11:     Find M1tsuperscriptsubscript𝑀1𝑡M_{1}^{t}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT= β1M1t1+(1β1)gtsubscript𝛽1superscriptsubscript𝑀1𝑡11subscript𝛽1superscript𝑔𝑡\beta_{1}M_{1}^{t-1}+(1-\beta_{1})g^{t}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_g start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
12:     Find M2tsuperscriptsubscript𝑀2𝑡M_{2}^{t}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT= β2M2t1+(1β2)vtsubscript𝛽2superscriptsubscript𝑀2𝑡11subscript𝛽2superscript𝑣𝑡\beta_{2}M_{2}^{t-1}+(1-\beta_{2})v^{t}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_v start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
13:     Find M1^t=M1t1β1tsuperscript^subscript𝑀1𝑡superscriptsubscript𝑀1𝑡1superscriptsubscript𝛽1𝑡\widehat{M_{1}}^{t}=\frac{M_{1}^{t}}{1-\beta_{1}^{t}}over^ start_ARG italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG and M2^t=M2t1β2tsuperscript^subscript𝑀2𝑡superscriptsubscript𝑀2𝑡1superscriptsubscript𝛽2𝑡\widehat{M_{2}}^{t}=\frac{M_{2}^{t}}{1-\beta_{2}^{t}}over^ start_ARG italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG, here βjtsuperscriptsubscript𝛽𝑗𝑡\beta_{j}^{t}italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is βjsubscript𝛽𝑗\beta_{j}italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to the power t
14:     Gtsuperscript𝐺𝑡absentG^{t}\leftarrowitalic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← M1^tRd×1superscript^subscript𝑀1𝑡superscript𝑅𝑑1\widehat{M_{1}}^{t}\in R^{d\times 1}over^ start_ARG italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_d × 1 end_POSTSUPERSCRIPT
15:     Vtsuperscript𝑉𝑡absentV^{t}\leftarrowitalic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← M2^tRd×1superscript^subscript𝑀2𝑡superscript𝑅𝑑1\widehat{M_{2}}^{t}\in R^{d\times 1}over^ start_ARG italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_d × 1 end_POSTSUPERSCRIPT
16:     Find Ztsuperscript𝑍𝑡Z^{t}italic_Z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT=VtVt[0]Rd×1superscript𝑉𝑡superscript𝑉𝑡delimited-[]0superscript𝑅𝑑1\frac{V^{t}}{V^{t}[0]}\in R^{d\times 1}divide start_ARG italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT [ 0 ] end_ARG ∈ italic_R start_POSTSUPERSCRIPT italic_d × 1 end_POSTSUPERSCRIPT, where Vt[0]superscript𝑉𝑡delimited-[]0V^{t}[0]italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT [ 0 ] is the first element of Vtsuperscript𝑉𝑡V^{t}italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
17:     Calculate update (Ha+ρI)1Gtsuperscriptsubscript𝐻𝑎𝜌𝐼1superscript𝐺𝑡(H_{a}+\rho I)^{-1}G^{t}( italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_ρ italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT=Gtρsuperscript𝐺𝑡𝜌\frac{G^{t}}{\rho}divide start_ARG italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_ρ end_ARG - ZtVtTGtρ21+VtTZtρsuperscript𝑍𝑡superscriptsuperscript𝑉𝑡𝑇superscript𝐺𝑡superscript𝜌21superscriptsuperscript𝑉𝑡𝑇superscript𝑍𝑡𝜌\frac{\frac{Z^{t}{V^{t}}^{T}G^{t}}{\rho^{2}}}{1+\frac{{V^{t}}^{T}Z^{t}}{\rho}}divide start_ARG divide start_ARG italic_Z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG 1 + divide start_ARG italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_ρ end_ARG end_ARG
18:     Find wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = wt1subscript𝑤𝑡1w_{t-1}italic_w start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - η(Ha+ρI)1Gt𝜂superscriptsubscript𝐻𝑎𝜌𝐼1superscript𝐺𝑡\eta(H_{a}+\rho I)^{-1}G^{t}italic_η ( italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_ρ italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
19:  end for

5 Proposed Method

One communication round of FAGH is shown in algo. 1. In our proposed FAGH, at communication round t, server first sends the global model wt1subscript𝑤𝑡1w_{t-1}italic_w start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to all the available clients. Each client Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT uses this model wt1subscript𝑤𝑡1w_{t-1}italic_w start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT as initial model and finds local gradient gitsuperscriptsubscript𝑔𝑖𝑡g_{i}^{t}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and first row of the true Hessian vitsuperscriptsubscript𝑣𝑖𝑡v_{i}^{t}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT with the help of their local data and local optimizer and shares gitsuperscriptsubscript𝑔𝑖𝑡g_{i}^{t}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and vitsuperscriptsubscript𝑣𝑖𝑡v_{i}^{t}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to the server. The server then aggregates all the local gradients to find global (average) gradient gtsuperscript𝑔𝑡g^{t}italic_g start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT across all the clients and aggregates all the local Hessian’s first rows to find first row of the global (average) Hessian vtsuperscript𝑣𝑡v^{t}italic_v start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and finds their first moments (exponential moving averages) Gtsuperscript𝐺𝑡G^{t}italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and Vtsuperscript𝑉𝑡V^{t}italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT respectively. The server utilizes Vtsuperscript𝑉𝑡V^{t}italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to approximate the global Hessian Hasubscript𝐻𝑎H_{a}italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and scales Gtsuperscript𝐺𝑡G^{t}italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT directly with the inverse of the regularized Hasubscript𝐻𝑎H_{a}italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT using the Sherman-Morrison formula for matrix inversion. Then the server uses this scaled Gtsuperscript𝐺𝑡G^{t}italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to find the global Newton direction.

5.1 Hessian approximation with first row of the true Hessian

Let wRd×1={x1,x2,,xd}𝑤superscript𝑅𝑑1subscript𝑥1subscript𝑥2subscript𝑥𝑑w\in R^{d\times 1}=\{x_{1},x_{2},...,x_{d}\}italic_w ∈ italic_R start_POSTSUPERSCRIPT italic_d × 1 end_POSTSUPERSCRIPT = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT } be the model parameters to be optimized. We approximate the Hessian with the help of first row of the true Hessian using Statement 1.

Statement 1 The Hessian of a twice differentiable loss function F with respect to w𝑤witalic_w can be approximated by using eq. 5.

Ha=VUT2Fx12subscript𝐻𝑎𝑉superscript𝑈𝑇superscript2𝐹superscriptsubscript𝑥12H_{a}=\frac{VU^{T}}{\frac{\partial^{2}F}{\partial x_{1}^{2}}}italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = divide start_ARG italic_V italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG (5)

Where, V Rd×1absentsuperscript𝑅𝑑1\in R^{d\times 1}∈ italic_R start_POSTSUPERSCRIPT italic_d × 1 end_POSTSUPERSCRIPT ={2Fx12,2Fx1x2,2Fx1x3,.,2Fx1xd}\{\frac{\partial^{2}F}{\partial x_{1}^{2}},\frac{\partial^{2}F}{\partial x_{1}% \partial x_{2}},\frac{\partial^{2}F}{\partial x_{1}\partial x_{3}},....,\frac{% \partial^{2}F}{\partial x_{1}\partial x_{d}}\}{ divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∂ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG , divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∂ italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG , … . , divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∂ italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG } is first row of the true Hessian ,
U Rd×1absentsuperscript𝑅𝑑1\in R^{d\times 1}∈ italic_R start_POSTSUPERSCRIPT italic_d × 1 end_POSTSUPERSCRIPT ={2Fx12,2Fx2x1,2Fx3x1,.,2Fxdx1}\{\frac{\partial^{2}F}{\partial x_{1}^{2}},\frac{\partial^{2}F}{\partial x_{2}% \partial x_{1}},\frac{\partial^{2}F}{\partial x_{3}\partial x_{1}},....,\frac{% \partial 2^{F}}{\partial x_{d}\partial x_{1}}\}{ divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , … . , divide start_ARG ∂ 2 start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG } is first column of the true Hessian.
Proof of Statement 1: Let H \in Rd×dsuperscript𝑅𝑑𝑑R^{d\times d}italic_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT be the true Hessian of the loss function F. Then the (i,j)thsuperscript𝑖𝑗𝑡(i,j)^{th}( italic_i , italic_j ) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT element of H is H(i,j)superscript𝐻𝑖𝑗H^{(i,j)}italic_H start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT= 2Fxixjsuperscript2𝐹subscript𝑥𝑖subscript𝑥𝑗\frac{\partial^{2}F}{\partial x_{i}\partial x_{j}}divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∂ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG, where i,j \in {1, 2, 3, …,d}.
Now, according to eq. 5, (i,j)thsuperscript𝑖𝑗𝑡(i,j)^{th}( italic_i , italic_j ) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT element of approximated Hessian (Hasubscript𝐻𝑎H_{a}italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT) is found as Ha(i,j)superscriptsubscript𝐻𝑎𝑖𝑗H_{a}^{(i,j)}italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT = Vj×Ui2Fx12superscript𝑉𝑗superscript𝑈𝑖superscript2𝐹superscriptsubscript𝑥12\frac{V^{j}\times U^{i}}{\frac{\partial^{2}F}{\partial x_{1}^{2}}}divide start_ARG italic_V start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT × italic_U start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG, where Vjsuperscript𝑉𝑗V^{j}italic_V start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = 2Fx1xjsuperscript2𝐹subscript𝑥1subscript𝑥𝑗\frac{\partial^{2}F}{\partial x_{1}\partial x_{j}}divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∂ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG and Uisuperscript𝑈𝑖U^{i}italic_U start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = 2Fxix1superscript2𝐹subscript𝑥𝑖subscript𝑥1\frac{\partial^{2}F}{\partial x_{i}\partial x_{1}}divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG are jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT and ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT elements of V and U respectively. So, we can write

Ha(i,j)=Vj×Ui2Fx12=2Fx1xj×2Fxix12Fx12=(Fx1)xj×(Fxi)x1(Fx1)x1superscriptsubscript𝐻𝑎𝑖𝑗superscript𝑉𝑗superscript𝑈𝑖superscript2𝐹superscriptsubscript𝑥12superscript2𝐹subscript𝑥1subscript𝑥𝑗superscript2𝐹subscript𝑥𝑖subscript𝑥1superscript2𝐹superscriptsubscript𝑥12𝐹subscript𝑥1subscript𝑥𝑗𝐹subscript𝑥𝑖subscript𝑥1𝐹subscript𝑥1subscript𝑥1H_{a}^{(i,j)}=\frac{V^{j}\times U^{i}}{\frac{\partial^{2}F}{\partial x_{1}^{2}% }}=\frac{\frac{\partial^{2}F}{\partial x_{1}\partial x_{j}}\times\frac{% \partial^{2}F}{\partial x_{i}\partial x_{1}}}{\frac{\partial^{2}F}{\partial x_% {1}^{2}}}=\frac{\frac{\partial(\frac{\partial F}{\partial x_{1}})}{\partial x_% {j}}\times\frac{\partial(\frac{\partial F}{\partial x_{i}})}{\partial x_{1}}}{% \frac{\partial(\frac{\partial F}{\partial x_{1}})}{\partial x_{1}}}italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT = divide start_ARG italic_V start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT × italic_U start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG = divide start_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∂ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG × divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG = divide start_ARG divide start_ARG ∂ ( divide start_ARG ∂ italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG × divide start_ARG ∂ ( divide start_ARG ∂ italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG divide start_ARG ∂ ( divide start_ARG ∂ italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG

Using chain rule of Leibniz’s notation Swokowski (1979), we rewrite the above expression by assuming that the objective function F is a non-linear function. Let, y is scalar valued output from the model (for classification task, y is considered as the output of corresponding true class for the input sample). If F is a non-linear function, then we can express Fxi𝐹subscript𝑥𝑖\frac{\partial F}{\partial x_{i}}divide start_ARG ∂ italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG as a function of y for all i{1,2,,d}𝑖12𝑑i\in\{1,2,...,d\}italic_i ∈ { 1 , 2 , … , italic_d }, and we can utilize the chain rule of Leibniz’s notation to reformulate the above expression as follows. For example, if F=log(y)𝐹𝑦F=\log(y)italic_F = roman_log ( italic_y ), then Fxi=1yyxi𝐹subscript𝑥𝑖1𝑦𝑦subscript𝑥𝑖\frac{\partial F}{\partial x_{i}}=\frac{1}{y}\frac{\partial y}{\partial x_{i}}divide start_ARG ∂ italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG italic_y end_ARG divide start_ARG ∂ italic_y end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG, which is a function of y. Another example: if F=(ya)2𝐹superscript𝑦𝑎2F=(y-a)^{2}italic_F = ( italic_y - italic_a ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, then Fxi=2(ya)yxi𝐹subscript𝑥𝑖2𝑦𝑎𝑦subscript𝑥𝑖\frac{\partial F}{\partial x_{i}}=2(y-a)\frac{\partial y}{\partial x_{i}}divide start_ARG ∂ italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = 2 ( italic_y - italic_a ) divide start_ARG ∂ italic_y end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG, which is also a function of y, here a is any constant.

Ha(i,j)=(Fx1)y×yxj×(Fxi)y×yx1(Fx1)y×yx1superscriptsubscript𝐻𝑎𝑖𝑗𝐹subscript𝑥1𝑦𝑦subscript𝑥𝑗𝐹subscript𝑥𝑖𝑦𝑦subscript𝑥1𝐹subscript𝑥1𝑦𝑦subscript𝑥1H_{a}^{(i,j)}=\frac{\frac{\partial(\frac{\partial F}{\partial x_{1}})}{% \partial y}\times\frac{\partial y}{\partial x_{j}}\times\frac{\partial(\frac{% \partial F}{\partial x_{i}})}{\partial y}\times\frac{\partial y}{\partial x_{1% }}}{\frac{\partial(\frac{\partial F}{\partial x_{1}})}{\partial y}\times\frac{% \partial y}{\partial x_{1}}}italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT = divide start_ARG divide start_ARG ∂ ( divide start_ARG ∂ italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG ∂ italic_y end_ARG × divide start_ARG ∂ italic_y end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG × divide start_ARG ∂ ( divide start_ARG ∂ italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG ∂ italic_y end_ARG × divide start_ARG ∂ italic_y end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG divide start_ARG ∂ ( divide start_ARG ∂ italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG ∂ italic_y end_ARG × divide start_ARG ∂ italic_y end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG

Ha(i,j)=(Fxi)y×yxj=2Fxixjsuperscriptsubscript𝐻𝑎𝑖𝑗𝐹subscript𝑥𝑖𝑦𝑦subscript𝑥𝑗superscript2𝐹subscript𝑥𝑖subscript𝑥𝑗H_{a}^{(i,j)}=\frac{\partial(\frac{\partial F}{\partial x_{i}})}{\partial y}% \times\frac{\partial y}{\partial x_{j}}=\frac{\partial^{2}F}{\partial x_{i}% \partial x_{j}}italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT = divide start_ARG ∂ ( divide start_ARG ∂ italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG ∂ italic_y end_ARG × divide start_ARG ∂ italic_y end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG = divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∂ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG

So, we can write Ha(i,j)=H(i,j)superscriptsubscript𝐻𝑎𝑖𝑗superscript𝐻𝑖𝑗H_{a}^{(i,j)}=H^{(i,j)}italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT = italic_H start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT

Which indicates that we can use eq. 5 for approximation of the Hessian.

As Hessian is symmetric matrix, we can say that the first column and the first row of the Hessian are identical. So we can say U=V. Putting U=V in eq. 5 we get,

Ha=VVT2Fx12=ZVTsubscript𝐻𝑎𝑉superscript𝑉𝑇superscript2𝐹superscriptsubscript𝑥12𝑍superscript𝑉𝑇H_{a}=\frac{VV^{T}}{\frac{\partial^{2}F}{\partial x_{1}^{2}}}=ZV^{T}italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = divide start_ARG italic_V italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG = italic_Z italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (6)

So, eq. 6 helps us to approximate the full Hessian with the help of only first row of the true Hessian, where Z= V2Fx12𝑉superscript2𝐹superscriptsubscript𝑥12\frac{V}{\frac{\partial^{2}F}{\partial x_{1}^{2}}}divide start_ARG italic_V end_ARG start_ARG divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG.

5.2 Finding global Newton direction

In our proposed method FAGH, server uses global gradient gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and first row of the global Hessian vtsubscript𝑣𝑡v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT while finding global Newton direction (where gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and vtsubscript𝑣𝑡v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are found by aggregating local gradients and local Hessian’s first rows respectively). Taking inspiration from ADAM Kingma and Ba (2015), we use hyper-parameters β1𝛽1\beta 1italic_β 1, β2𝛽2\beta 2italic_β 2 \in [0, 1) for finding the exponential moving averages M1tsuperscriptsubscript𝑀1𝑡M_{1}^{t}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and M2tsuperscriptsubscript𝑀2𝑡M_{2}^{t}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT of gtsuperscript𝑔𝑡g^{t}italic_g start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and vtsuperscript𝑣𝑡v^{t}italic_v start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT respectively, which helps to update the global model without forgetting knowledge gained in previous communication rounds that is well suited for partial device participation, where all the clients may not be available at a certain communication round. As M1tsuperscriptsubscript𝑀1𝑡M_{1}^{t}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and M2tsuperscriptsubscript𝑀2𝑡M_{2}^{t}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT are initialized as (vectors of) 0’s, these are biased towards zero. So we use bias-corrected estimates M1^tsuperscript^subscript𝑀1𝑡\widehat{M_{1}}^{t}over^ start_ARG italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and M2^tsuperscript^subscript𝑀2𝑡\widehat{M_{2}}^{t}over^ start_ARG italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, which have been shown in algo. 1. We find approximated global Hessian (Hasubscript𝐻𝑎H_{a}italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT) by putting Vtsuperscript𝑉𝑡V^{t}italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = M2^tsuperscript^subscript𝑀2𝑡\widehat{M_{2}}^{t}over^ start_ARG italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT in eq. 6 and use the following regularized variant of Newton type update to avoid forming indefinite global Hessian Battiti (1992).

wt=wt1η(Ha+ρI)1Gtsubscript𝑤𝑡subscript𝑤𝑡1𝜂superscriptsubscript𝐻𝑎𝜌𝐼1superscript𝐺𝑡w_{t}=w_{t-1}-\eta{(H_{a}+\rho I)}^{-1}G^{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_η ( italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_ρ italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (7)

Where ρ>0𝜌0\rho>0italic_ρ > 0 is the regularisation parameter, I is a identity matrix \in Rd×dsuperscript𝑅𝑑𝑑R^{d\times d}italic_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT and Gtsuperscript𝐺𝑡G^{t}italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = M1^tsuperscript^subscript𝑀1𝑡\widehat{M_{1}}^{t}over^ start_ARG italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. We use Sherman-Morrison formula of matrix inversion to directly compute the global Newton direction without forming and storing full regularized Hessian (Ha+ρI)subscript𝐻𝑎𝜌𝐼(H_{a}+\rho I)( italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_ρ italic_I ) and its inverse as shown below-

(Ha+ρI)1Gtsuperscriptsubscript𝐻𝑎𝜌𝐼1superscript𝐺𝑡(H_{a}+\rho I)^{-1}G^{t}( italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_ρ italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT= (ZtVtT+ρI)1Gtsuperscriptsuperscript𝑍𝑡superscriptsuperscript𝑉𝑡𝑇𝜌𝐼1superscript𝐺𝑡(Z^{t}{V^{t}}^{T}+\rho I)^{-1}G^{t}( italic_Z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_ρ italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT

(Ha+ρI)1Gtsuperscriptsubscript𝐻𝑎𝜌𝐼1superscript𝐺𝑡(H_{a}+\rho I)^{-1}G^{t}( italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_ρ italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT=Gtρsuperscript𝐺𝑡𝜌\frac{G^{t}}{\rho}divide start_ARG italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_ρ end_ARG - ZtVtTGtρ21+VtTZtρsuperscript𝑍𝑡superscriptsuperscript𝑉𝑡𝑇superscript𝐺𝑡superscript𝜌21superscriptsuperscript𝑉𝑡𝑇superscript𝑍𝑡𝜌\frac{\frac{Z^{t}{V^{t}}^{T}G^{t}}{\rho^{2}}}{1+\frac{{V^{t}}^{T}Z^{t}}{\rho}}divide start_ARG divide start_ARG italic_Z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG 1 + divide start_ARG italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_ρ end_ARG end_ARG
where, Zt=VtVt[0]superscript𝑍𝑡superscript𝑉𝑡superscript𝑉𝑡delimited-[]0Z^{t}=\frac{V^{t}}{V^{t}[0]}italic_Z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT [ 0 ] end_ARG and Vt[0]=2Fx12superscript𝑉𝑡delimited-[]0superscript2𝐹superscriptsubscript𝑥12V^{t}[0]=\frac{\partial^{2}F}{\partial x_{1}^{2}}italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT [ 0 ] = divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG is the first element of Vtsuperscript𝑉𝑡V^{t}italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

Refer to caption
Figure 1: Comparisons of training loss, test loss and test accuracy on CIFAR10 image classification using LeNet5
Refer to caption
Figure 2: Comparisons of training loss, test loss and test accuracy on FashionMNIST image classification using CNN
Refer to caption
Figure 3: Comparisons of training loss, test loss and test accuracy on EMNIST image classification using MLR
Refer to caption
Figure 4: Time comparisons of training loss, test loss and test accuracy on CIFAR10 image classification using LeNet5
Refer to caption
Figure 5: Time comparisons of training loss, test loss and test accuracy on FashionMNIST image classification using CNN
Refer to caption
Figure 6: Time comparisons of training loss, test loss and test accuracy on EMNIST image classification using MLR

5.3 Complexities

As FAGH is associated with the computation of gradient and the first row of the true Hessian in the local client, the local time and space complexities of FAGH are both O(d + d), which is similar to existing first-order based methods like SCAFFOLD Karimireddy et al. (2020), FedDC Gao et al. (2022), etc. Second-order based methods like GIANT and DONE have a local complexity of O(md+d), which is higher than FAGH (Here m>1𝑚1m>1italic_m > 1 is the number of iterations associated with approximating the Newton update). The local space complexities of both DONE and GIANT are O( d + d), which is similar to FAGH. FedNL and Basis Matters store the previous step’s Hessian matrix to approximate the current step’s Hessian in local clients, which requires O(d2)𝑂superscript𝑑2O(d^{2})italic_O ( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) local space complexity. Compared to this, FAGH is highly beneficial for resource-constrained local clients. In FAGH, as the server needs to retain information on the previous global gradient and the first row of the previous global Hessian to find the exponential moving averages of the current gradient and the current Hessian’s first row, which results in the requirement of O(2d + 2d) space complexity on the server. As the server directly computes the global Newton direction with the help of the Sherman-Morrison formula for matrix inversion, the overall time complexity of the server is O(d). The server’s space complexity of FAGH is only three times more than FedAvg, which is very less compared to FedSSO’s server space complexity O(d2)𝑂superscript𝑑2O(d^{2})italic_O ( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Sustaining O(4d) space complexity may not be a big issue with the host server. One issue with FAGH is discovered that, like SCAFFOLD, FAGH is associated with O(2d) client-to-server communication cost, which can be handled using proper compression techniques before communication to the server. The applicability of compression in FL can be seen from FedNL and Basis Matters, where the local Hessian is compressed before sending it to the server.

5.4 Applicability of FAGH

As the approximation of the Hessian in section 5.1 is tailored for twice-differentiable and non-linear loss function F, FAGH is applicable only to optimization problems associated with such twice-differentiable and non-linear loss functions. For example, optimizing neural networks or multinomial logistic regression (MLR) with cross-entropy loss function entails dealing with such twice-differentiable and non-linear loss function. To assess the applicability of FAGH, we conducted extensive experiments on federated image classification tasks using machine learning and deep learning models with cross-entropy loss function, from which we observed promising outcomes from FAGH.

6 Experimental setup

To validate our proposed method, we conduct extensive experiments on heterogeneously partitioned CIFAR10 Krizhevsky et al. (2009), FashionMNIST Xiao et al. (2017) and EMNIST-letters Cohen et al. (2017) datasets. CIFAR10 comprises color images (3×32×32332323\times 32\times 323 × 32 × 32) of 10 classes with total 60000 samples (50000 training samples and 10000 test sample). FashionMNIST comprises grayscale images (1×28×28128281\times 28\times 281 × 28 × 28) of 10 classes with total 60000 samples (50000 training samples and 10000 test sample). EMNIST-letters comprises grayscale images (1×28×28128281\times 28\times 281 × 28 × 28 pixels) of handwritten uppercase and lowercase letters, divided into 26 classes. EMNIST-letters includes a total of 145,600 samples, with 124,800 for training and 20,800 for testing. To create heterogeneous data partitions for CIFAR10 and FashionMNIST datasets, we use the same Dirichlet distribution based heterogeneous and unbalanced partition strategy as mentioned in the papers of Yurochkin et al. and Wang et al.. We simulate PiDirK(0.2)similar-tosubscript𝑃𝑖𝐷𝑖subscript𝑟𝐾0.2P_{i}\sim Dir_{K}(0.2)italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_D italic_i italic_r start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( 0.2 ) and find a heterogeneous partition by allocating a P(i,j)subscript𝑃𝑖𝑗P_{(i,j)}italic_P start_POSTSUBSCRIPT ( italic_i , italic_j ) end_POSTSUBSCRIPT proportion of the samples of ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT class to jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT client. As we use very small value of Dirichlet distribution’s concentration parameter (0.2), each client may not get samples of all the classes, which indicates a high degree of data heterogeneity across all the clients. For our experiments, we use K=200 clients. For EMNIST dataset, we utilizes similar partitioning strategy used in the paper of McMahan et al., where the data has been sorted with the class label and then distributed. We create 400 shards of size 312 and assign 2 shards to each of the 200 clients.

For CIFAR10 image classification, we use LeNet5 model LeCun and others (2015). For FashionMNIST, we use a custom convolutional neural network (CNN) model (total 1475338 trainable parameters) with two convolutional layers and three fully connected layers. After each convolutional layer, we use batch normalization, ReLU activation and max-pooling. After first fully connected layer, we use a dropout of 0.25. For EMNIST-letters, we use multinomial logistic regression (MLR) model. For all the federated image classification tasks, we use crossentropy loss function.

We compare our algorithm with existing state-of-the-art federated learning algorithms such as SCAFFOLD, FedGA, FedExP, GIANT and DONE. To consider partial device participation in FL, we use 40%percent4040\%40 % of total client’s participation in each communication round. We do extensive experiments with a wide set of hyper-parameters for all the methods and find the best performing model for each method by considering minimum training &\&& test losses and maximum test accuracy. We use FedGA β𝛽\betaitalic_β, FAGH ρ𝜌\rhoitalic_ρ and FedExP ϵitalic-ϵ\epsilonitalic_ϵ {1,0.5,0.1,0.01,0.001}absent10.50.10.010.001\in\{1,0.5,0.1,0.01,0.001\}∈ { 1 , 0.5 , 0.1 , 0.01 , 0.001 }, learning rate η{1,0.5,0.1,0.01,0.001,0.0001}𝜂10.50.10.010.0010.0001\eta\in\{1,0.5,0.1,0.01,0.001,0.0001\}italic_η ∈ { 1 , 0.5 , 0.1 , 0.01 , 0.001 , 0.0001 }, FAGH β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 &\&& β2=0.99subscript𝛽20.99\beta_{2}=0.99italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.99, number of Rechardson iterations for DONE=10, number of CG iterations for GIANT = 10 and αDONE{0.01,0.05}subscript𝛼𝐷𝑂𝑁𝐸0.010.05\alpha_{DONE}\in\{0.01,0.05\}italic_α start_POSTSUBSCRIPT italic_D italic_O italic_N italic_E end_POSTSUBSCRIPT ∈ { 0.01 , 0.05 }. For FedExP, we use SGD with momentum (0.9) optimizer for finding local updates. We use total number of communication rounds T=100. We implement all the methods using Tesla V100 GPU and PyTorch1.12.1+cu102. we use seed=0 and batch size = 512. For each dataset, we use same initialization and same settings for all the existing and proposed methods.

We compare our algorithm with DONE and GIANT in MLR based classification task. We tried to compare our algorithm with DONE and GIANT in CNN based classification tasks . But unfortunately, we did not find suitable hyperparameters for DONE and GIANT for this CNN based implementation. This may be due to their assumption of strongly convex loss function.

Method 30%percent3030\%30 % 35%percent3535\%35 % 40%percent4040\%40 % 45%percent4545\%45 %
FAGH 18 29 43 78
FedGA 23 54 96
SCAFFOLD 38 66
FedExP 24 48 97
Table 1: Comparisons of communication rounds’ while achieving differnet test accuracies results on CIFAR10 with LeNet5 model. Here, ”…” means that the algorithm can not be able to achieve the accuracy within FL iterations T = 100.
Method 60%percent6060\%60 % 70%percent7070\%70 % 80%percent8080\%80 % 85%percent8585\%85 %
FAGH 7 14 36 71
FedGA 7 12 46
SCAFFOLD 13 35
FedExP 8 18
Table 2: Comparisons of communication rounds’ while achieving differnet test accuracies results on FashionMNIST with CNN model. Here, ”…” means that the algorithm can not be able to achieve the accuracy within FL iterations T = 100.
Method 40%percent4040\%40 % 50%percent5050\%50 % 60%percent6060\%60 % 65%percent6565\%65 %
FAGH 6 11 25 58
FedGA 11 21 46
SCAFFOLD 13 23 61
FedExP 11 19 55
DONE 7 12 34 96
GIANT 8 12 33 79
Table 3: Comparisons of communication rounds’ while achieving differnet test accuracies results on EMNIST with MLR model. Here, ”…” means that the algorithm can not be able to achieve the accuracy within FL iterations T = 100.

6.1 Results

Our experimental results are shown in figs. [1, 2, 3, 4, 5, 6] and tables [1, 2, 3]. From these figures, it may be observed that FAGH can decrease the train and test losses in less time and less communication rounds as compared to SCAFFOLD, FedGA, FedExP, GIANT &\&& DONE. It also may be observed that FAGH can achieve better test accuracy at different time steps and communication rounds as compared to SCAFFOLD, FedGA, FedExP, GIANT &\&& DONE. From the tables, it may be observed that FAGH takes comparatively less number of communication rounds for achieving different targeted test accuracies. As we use the same initialization and same settings for all the methods, we may claim that FAGH can provide faster FL training while achieving a certain precision of the global model performance in heterogeneous FL settings with partial clients participation. FAGH is easy to implement, as it has only two active tuning hyper-parameters, one is Hessian regularization parameters (ρ𝜌\rhoitalic_ρ) and another one is learning rate (η𝜂\etaitalic_η). From our experiments, we noticed that same as ADAM Kingma and Ba (2015), we can standardize the exponential decay rates for the moment estimates of FAGH (β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) to 0.9 and 0.99 respectively.

7 Conclusions

We proposed a new Newton optimization-based FL training method, namely FAGH, by making use of the approximated global Hessian for accelerating the convergence of global model training in FL, thereby resolving the challenge of the heavy communication overhead in FL due to a large amount of communication rounds needed to train the global model toward convergence. FAGH is beneficial for practical implementation in terms of both local and server space complexities in comparison to existing Newton-based FL training algorithms. Experimental results demonstrate that FAGH outperforms several state-of-the-art FL training methods, including SCAFFOLD, FedGA, FedExP, GIANT, and DONE, in terms of the number of communication rounds and the time required to train the global model in FL to achieve the pre-specified performance objectives. In the future, we plan to investigate how to identify a set of local clients for participating in training the global model in an adaptive and privacy-preserving manner, e.g., by leveraging learning vector quantization Qin and Suganthan (2005) and graph matching Gong et al. (2016) techniques, to further improve the convergence of the global model while kee** its performance in other aspects.

References

  • Agarwal et al. [2017] Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time. J. Mach. Learn. Res., 18:116:1–116:40, 2017.
  • Battiti [1992] Roberto Battiti. First and second-order methods for learning: Between steepest descent and newton’s method. Neural Comput., 4(2):141–166, 1992.
  • Bischoff et al. [2021] Sebastian Bischoff, Stephan Günnemann, Martin Jaggi, and Sebastian U Stich. On second-order optimization methods for federated learning. arXiv preprint arXiv:2109.02388, 2021.
  • Cohen et al. [2017] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 international joint conference on neural networks (IJCNN), pages 2921–2926. IEEE, 2017.
  • Dandi et al. [2022] Yatin Dandi, Luis Barba, and Martin Jaggi. Implicit gradient alignment in distributed and federated learning. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 6454–6462. AAAI Press, 2022.
  • Derezinski and Mahoney [2019] Michal Derezinski and Michael W. Mahoney. Distributed estimation of the inverse hessian by determinantal averaging. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 11401–11411, 2019.
  • Dinh et al. [2020] Canh T. Dinh, Nguyen Hoang Tran, and Tuan Dung Nguyen. Personalized federated learning with moreau envelopes. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  • Dinh et al. [2022] Canh T. Dinh, Nguyen H. Tran, Tuan Dung Nguyen, Wei Bao, Amir Rezaei Balef, Bing Bing Zhou, and Albert Y. Zomaya. DONE: distributed approximate newton-type method for federated edge learning. IEEE Trans. Parallel Distributed Syst., 33(11):2648–2660, 2022.
  • Gao et al. [2022] Liang Gao, Huazhu Fu, Li Li, Yingwen Chen, Ming Xu, and Cheng-Zhong Xu. Feddc: Federated learning with non-iid data via local drift decoupling and correction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10102–10111. IEEE, 2022.
  • Gong et al. [2016] Maoguo Gong, Yue Wu, Qing Cai, Wen** Ma, A. K. Qin, Zhenkun Wang, and Licheng Jiao. Discrete particle swarm optimization for high-order graph matching. Information Sciences, 328:158–171, 2016.
  • Jhunjhunwala et al. [2023] Divyansh Jhunjhunwala, Shiqiang Wang, and Gauri Joshi. Fedexp: Speeding up federated averaging via extrapolation. CoRR, abs/2301.09604, 2023.
  • Karimireddy et al. [2020] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. SCAFFOLD: stochastic controlled averaging for federated learning. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 5132–5143. PMLR, 2020.
  • Ketkar and Ketkar [2017] Nikhil Ketkar and Nikhil Ketkar. Stochastic gradient descent. Deep learning with Python: A hands-on introduction, pages 113–132, 2017.
  • Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • LeCun and others [2015] Yann LeCun et al. Lenet-5, convolutional neural networks. URL: http://yann. lecun. com/exdb/lenet, 20(5):14, 2015.
  • Li et al. [2020a] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Feddane: A federated newton-type method. CoRR, abs/2001.01920, 2020.
  • Li et al. [2020b] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. In Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020. mlsys.org, 2020.
  • Li et al. [2020c] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on non-iid data. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020.
  • Li et al. [2021] Qinbin Li, Bingsheng He, and Dawn Song. Model-contrastive federated learning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 10713–10722, 2021.
  • Liu and Nocedal [1989] Dong C. Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Math. Program., 45(1-3):503–528, 1989.
  • Ma et al. [2022] Xin Ma, Renyi Bao, **peng Jiang, Yang Liu, Arthur Jiang, Jun Yan, Xin Liu, and Zhisong Pan. Fedsso: A federated server-side second-order optimization algorithm. CoRR, abs/2206.09576, 2022.
  • Martens and Grosse [2015] James Martens and Roger B. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In Francis R. Bach and David M. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 2408–2417. JMLR.org, 2015.
  • McMahan et al. [2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, volume 54, pages 1273–1282. PMLR, 2017.
  • Nazareth [2009] John L Nazareth. Conjugate gradient method. Wiley Interdisciplinary Reviews: Computational Statistics, 1(3):348–353, 2009.
  • Qian et al. [2022] Xun Qian, Rustem Islamov, Mher Safaryan, and Peter Richtárik. Basis matters: Better communication-efficient second order methods for federated learning. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, International Conference on Artificial Intelligence and Statistics, AISTATS 2022, 28-30 March 2022, Virtual Event, volume 151 of Proceedings of Machine Learning Research, pages 680–720. PMLR, 2022.
  • Qin and Suganthan [2005] A. K. Qin and P. N. Suganthan. Initialization insensitive LVQ algorithm based on cost-function adaptation. Pattern Recognition, 38(5):773–776, 2005.
  • Safaryan et al. [2022] Mher Safaryan, Rustem Islamov, Xun Qian, and Peter Richtárik. Fednl: Making newton-type methods applicable to federated learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 18959–19010. PMLR, 2022.
  • Shamir et al. [2014] Ohad Shamir, Nathan Srebro, and Tong Zhang. Communication-efficient distributed optimization using an approximate newton-type method. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Bei**g, China, 21-26 June 2014, volume 32, pages 1000–1008, 2014.
  • Swokowski [1979] Earl William Swokowski. Calculus with analytic geometry. Taylor & Francis, 1979.
  • Tan et al. [2021] Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang Yang. Towards personalized federated learning. CoRR, abs/2103.00710, 2021.
  • Tankaria et al. [2021] Hardik Tankaria, Dinesh Singh, and Makoto Yamada. Nys-curve: Nyström-approximated curvature for stochastic optimization. CoRR, abs/2110.08577, 2021.
  • Vuchkov [2022] Radoslav G Vuchkov. Hessian Approximations for Large-Scale Inverse Problems Governed By Partial Differential Equations. PhD thesis, UC Merced, 2022.
  • Wang et al. [2018] Shusen Wang, Farbod Roosta-Khorasani, Peng Xu, and Michael W. Mahoney. GIANT: globally improved approximate newton method for distributed optimization. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 2338–2348, 2018.
  • Wang et al. [2020a] Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris S. Papailiopoulos, and Yasaman Khazaeni. Federated learning with matched averaging. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
  • Wang et al. [2020b] Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H. Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  • Xiao et al. [2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  • Yurochkin et al. [2019] Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan H. Greenewald, Trong Nghia Hoang, and Yasaman Khazaeni. Bayesian nonparametric federated learning of neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 7252–7261. PMLR, 2019.