FAGH: Accelerating Federated Learning with Approximated Global Hessian

Mrinmay Sen

{}^{1,2}

A. K. Qin

{}^{1}

Krishna Mohan C

{}^{2}

{}^{1}

Dept. of Computing Technologies, Swinburne University of Technology, Hawthorn, VIC, Australia

{}^{2}

Dept. of Artificial Intelligence, Indian Institute of Technology Hyderabad, Hyderabad, India

Abstract

In federated learning (FL), the significant communication overhead due to the slow convergence speed of training the global model poses a great challenge. Specifically, a large number of communication rounds are required to achieve the convergence in FL. One potential solution is to employ the Newton-based optimization method for training, known for its quadratic convergence rate. However, the existing Newton-based FL training methods suffer from either memory inefficiency or high computational costs for local clients or the server. To address this issue, we propose an FL with approximated global Hessian (FAGH) method to accelerate FL training. FAGH leverages the first moment of the approximated global Hessian and the first moment of the global gradient to train the global model. By harnessing the approximated global Hessian curvature, FAGH accelerates the convergence of global model training, leading to the reduced number of communication rounds and thus the shortened training time. Experimental results verify FAGH’s effectiveness in decreasing the number of communication rounds and the time required to achieve the pre-specified objectives of the global model performance in terms of training and test losses as well as test accuracy. Notably, FAGH outperforms several state-of-the-art FL training methods.

1 Introduction

In centralized learning, all data is collected in one place and used to train a machine learning model. Despite its high performance, centralized learning poses risks of privacy leakage and communication overhead when collecting data from different sources or clients. These challenges motivate the transition to federated learning from centralized learning. In the most popular and baseline algorithm of federated learning, FedAvg McMahan et al. (2017), locally trained models are collected on the server instead of raw data, and the server aggregates all local models to find the global model, which is then sent back to all clients for further training. Sending local models instead of raw data to the server helps overcome data transfer challenges. One communication round of FedAvg involves two communications: sending the global model from the server to all available clients and sharing locally trained models with the server, resulting in a communication cost of O(2d). Since FedAvg employs a first-order stochastic gradient descent optimizer Ketkar and Ketkar (2017) to update local models, it is computationally efficient, with a local time complexity of O(d), where d represents the number of model parameters. FedAvg performs well when data are homogeneously distributed across all clients Li et al. (2020c), a scenario which is rarely encountered in real-life applications. In cases of heterogeneous data distribution, FedAvg suffers from objective inconsistency Karimireddy et al. (2020); Li et al. (2020b, c); Tan et al. (2021); Wang et al. (2020b). This inconsistency occurs when the global model converges to a stationary point that may not be the optimum of the global objective function, resulting in slow training of the global model, increased number of communication rounds, and time required to achieve a certain performance level. To accelerate FL training in scenarios of heterogeneous data distribution, several modifications have been proposed for FedAvg, including FedProx Li et al. (2020b), FedNova Wang et al. (2020b), SCAFFOLD Karimireddy et al. (2020), MOON Li et al. (2021), FedDC Gao et al. (2022), pFedMe Dinh et al. (2020), FedGA Dandi et al. (2022), FedExP Jhunjhunwala et al. (2023), among others. Although these modifications generally outperform FedAvg, the learning of the global model remains slower when aiming for a targeted performance, as these methods primarily utilize first-order gradient information for optimizing model parameters. Additionally, these methods are highly sensitive to hyperparameter choices.

To further accelerate FL training, researchers have shifted their attention from first-order optimization to the second-order Newton method of optimization due to its higher convergence rate compared to first-order methods Agarwal et al. (2017); Tankaria et al. (2021). Although the Newton method of optimization outperforms first-order optimizations in term of convergence, there are challenges in calculating and storing the Hessian and its inverse for large-scale settings (with time complexities of $O(d^{2})$ and $O(d^{3})$ respectively for calculating the Hessian and its inverse, and a space complexity of $O(d^{2})$ for storing them). To address these challenges, researchers have focused on approximating the Hessian Agarwal et al. (2017); Liu and Nocedal (1989); Martens and Grosse (2015); Tankaria et al. (2021); Nazareth (2009); Vuchkov (2022) instead of using the true Hessian while optimizing model parameters. In federated learning, another issue arises when local models are updated using the Newton method of optimization. Since the Newton method of optimization utilizes the Hessian inverse for updating model parameters, averaging all locally trained models to find the global model is not feasible Derezinski and Mahoney (2019). State-of-the-art solutions to these problems can be categorized into three approaches. The first approach involves computing the local Hessian matrix and sending it to the server in a compressed form, where the server aggregates all local Hessian information to determine the global Newton direction Safaryan et al. (2022); Qian et al. (2022). The second approach is based on finding the local Newton direction with global gradient information Wang et al. (2018); Dinh et al. (2022). The third approach utilizes the Quasi-Newton method of optimization Ma et al. (2022), where local first-order information is collected and aggregated on the server to find the global Newton direction. Existing Newton method-based federated optimizations Bischoff et al. (2021) include DANE Shamir et al. (2014), GIANT Wang et al. (2018), FedDANE Li et al. (2020a), FedSSO Ma et al. (2022), FedNL Safaryan et al. (2022), Basis Matters Qian et al. (2022), DONE Dinh et al. (2022), etc., which are either computationally expensive for local clients due to the calculation of the Hessian matrix, memory inefficient due to storing the Hessian matrix, or associated with four times communication in each FL communication round.

To expedite FL training, this paper introduces FAGH, a Newton method-based federated learning algorithm that eliminates the need for four separate communications, as seen in approaches like DONE or GIANT. FAGH approximates the global Newton direction on the server without computing and storing the full Hessian matrix. In FAGH, the server collects gradients and the first row of the true Hessian from each local client. Utilizing the first moment of the average gradient and the first moment of the average first row of the true Hessian across all clients, the server determines the global Newton direction and updates the global model. By leveraging the approximated global Hessian curvature, FAGH accelerates the convergence of global model training, resulting in a reduced number of communication rounds and shorter training times. Experimental results confirm FAGH’s effectiveness in reducing the required number of communication rounds to achieve predefined objectives for global model performance, including training and test losses, as well as test accuracy. Notably, FAGH outperforms several state-of-the-art FL training methods. Since FAGH utilizes only the first row of the true Hessian when determining the Newton direction, it can significantly reduce local time and space complexities compared to existing second-order FL algorithms.

The main contributions of FAGH are as follows.

•

In FAGH, each client finds gradient and Hessian’s first row of the local loss function and sends these to the server. Server finds the first moments of global gradient and global Hessian’s first row.
•

FAGH directly finds global Newton direction with the help of these first moments of global gradient and first row of the global Hessian without storing and calculating the full global Hessian matrix in the server.
•

Use of this directly computed global Newton direction leads to faster training in federated learning with linear time computational and space complexities.

The rest of the paper are arranged as follows. Section 2 elaborates related works, Section 3 elaborates the problem formulation, section 4 gives basic Preliminaries, section 5 discusses about our proposed method, section 6 shows the experimental setup and results and section 7 concludes our whole works.

2 Related Work

The related works can be classified into two categories: first-order based and second-order based FL approaches.

Existings first-order based approaches consist of FedProx, FedNova, SCAFFOLD, MOON, FedDC, pFedMe, FedGA, FedExP etc. In FedProx, the local objective function or loss function is modified by incorporating a proximal term( $\mu$ ), which helps to control the direction of local gradient. Instead of using simple average or weighted average in server , FedNova uses normalized averaging to get the global model. To control drastically fluctuation of local gradient, which is caused by data heterogeneity, SCAFFOLD uses variance reduction while updating local model. To correct the local training, MOON conducts model-level contrastive learning where the similarity between model representations is utilized. To control the update of local model, FedDC uses an auxiliary local drift variable. Moreau envelopes regularized loss function is used in pFedMe. FedGA finds the displacement of the local gradient with respect to the global gradient and uses this while initiating local models. To speed up FL training, FedExP adaptively finds the server step size or learning rate by using extrapolation mechanism of Projection Onto Convex Sets (POCS) algorithm.

Existings second-order based approaches include DANE, GIANT, FedDANE, FedSSO, FedNL, Basis Matters, DONE etc. FedSSO utilizes server based Quasi-Newton method on global gradient (average gradient across all the clients) to find the global Newton direction. FedSSO has the same local time complexity as FedAvg (O(d)), but it involves with storing of the full Hessian matrix in the server, which may not be practical in the large scale settings to sustain the server space complexity of $O(d^{2})$ . DANE, GIANT and FedDANE utilize conjugate gradient method to approximate the Hessian and DONE uses Richardson Iteration. DANE, GIANT, FedDANE and DONE require global gradient communicated by the server while finding local Newton direction, which increases total time for one FL iteration (One FL iteration of these methods consist of four separate communications- 1. Sending initial model from server to clients 2. Sending local gradients from clients to server 3. Sending average gradient from server to clients and 4. Sending locally updated models from clients to server ). Utilization of conjugate gradient method or Richardson Iteration involves time complexity of O(md+d), where m is number of conjugate gradient or Richardson iterations, which increases local time complexity by m times than FedAvg. FedNL and Basis Matters are associated with finding local Hessian information and sending it to the server in a compressed form. FedNL and Basis Matters store previous step’s Hessian matrix to approximate current step’s Hessian. Storing, calculation and compression of local Hessian results in additional computational and memory load to the local clients.

3 Problem formulation

Let $C=\{C_{1},C_{2},..,C_{K}\}$ is the set of participating clients in federated learning, where K is number of clients. $D_{i}$ is the dataset owned by Client $C_{i}$ . The goal of federated learning is to find the optima of the global objective function $F(w)$ $\forall$ $w\in R^{d}$ as mentioned in eq.1

\min_{w}F(w)=\sum^{K}_{i=1}p_{i}F_{i}(w;D_{i})

(1)

where, $w$ is model parameters, $F_{i}(w)=\frac{1}{|D_{i}|}\sum_{\xi_{j}\in D_{i}}f_{j}(w;\xi_{j})$ is the average loss of Client $C_{i}$ computed on dataset $D_{i}$ , $f_{j}$ is local loss for sample $\xi_{j}\in D_{i}$ and $\sum_{i=1}^{K}p_{i}=1$ .

4 Preliminaries

4.1 Newton method of optimization

Newton method of optimization is similar to first-order stochastic gradient descent (SGD) Ketkar and Ketkar (2017) with only difference in finding update direction. In SGD, the gradient of objective function is scaled by a learning rate parameter $\eta$ to find update direction, which is shown in eq.2. But in Newton method (eq. 3), the update direction is found by scaling the gradient with the help of inverse of the true Hessian (H), which incorporates curvature information while searching for optima of the objective function. Use of second-order Hessian curvature while optimizing model parameters leads to quadratic convergence rate Agarwal et al. (2017), which motivates us to use Newton method in FL for accelerating global model training.

w_{t}=w_{t-1}-\eta g_{t-1}

(2)

w_{t}=w_{t-1}-H^{-1}g_{t-1}

(3)

4.2 Sherman Morrison formula of matrix inversion

The inverse of the the matrix (B + $ZV^{T}$ ) $\in R^{d\times d}$ can be calculated using Sherman Morrison formula of matrix inversion as shown in below equation.

(B+ZV^{T})^{-1}=B^{-1}-\frac{B^{-1}ZV^{T}B^{-1}}{1+V^{T}B^{-1}Z}

(4)

Where, B $\in R^{d\times d}$ be a invertible square matrix and Z, V $\in R^{d\times 1}$ are column vectors.

Algorithm 1 FAGH

0: Input:

T

: Number of global epochs,

w_{0}

: Initial global model,

\eta

: learning rate,

\rho

: Hessian regularization parameter, {

\beta_{1},\beta_{2}\}\in[0,1)

: Exponential decay rates for the moment estimates, {

M_{1}^{0}\leftarrow 0,M_{2}^{0}\leftarrow 0

}: Initial moment vectors which are initialized with zero

1: for

t=1

T

2: Randomly pick a subset of clients

\overline{C}\subseteq C

3: Server sends global model

w_{t-1}

to all the available clients

\overline{C}

4: In clients:

5: for

i=1

|\overline{C}|

6: Client

C_{i}

finds local gradient

g_{i}^{t}

\frac{\partial F_{i}(w_{t-1},D_{i})}{\partial w_{t-1}}

and first row of the true hessian

v_{i}^{t}

\frac{\partial g_{i}^{t}[0]}{\partial w_{t-1}}

, where

g_{i}^{t}[0]

is the first element of

g_{i}^{t}

7: end for

8: In server:

9: Collect

g_{i}^{t}

and

v_{i}^{t}

from all the clients

10: Aggregate all

g_{i}^{t}

to find global (average) gradient

g^{t}=\sum p_{i}g_{i}^{t}

and aggregate all

v_{i}^{t}

to find first row of the global (average) Hessian

v^{t}=\sum p_{i}v_{i}^{t}

across all the clients

11: Find

M_{1}^{t}

\beta_{1}M_{1}^{t-1}+(1-\beta_{1})g^{t}

12: Find

M_{2}^{t}

\beta_{2}M_{2}^{t-1}+(1-\beta_{2})v^{t}

13: Find

\widehat{M_{1}}^{t}=\frac{M_{1}^{t}}{1-\beta_{1}^{t}}

and

\widehat{M_{2}}^{t}=\frac{M_{2}^{t}}{1-\beta_{2}^{t}}

, here

\beta_{j}^{t}

\beta_{j}

to the power t

14:

G^{t}\leftarrow

\widehat{M_{1}}^{t}\in R^{d\times 1}

15:

V^{t}\leftarrow

\widehat{M_{2}}^{t}\in R^{d\times 1}

16: Find

Z^{t}

\frac{V^{t}}{V^{t}[0]}\in R^{d\times 1}

, where

V^{t}[0]

is the first element of

V^{t}

17: Calculate update

(H_{a}+\rho I)^{-1}G^{t}

\frac{G^{t}}{\rho}

\frac{\frac{Z^{t}{V^{t}}^{T}G^{t}}{\rho^{2}}}{1+\frac{{V^{t}}^{T}Z^{t}}{\rho}}

18: Find

w_{t}

w_{t-1}

\eta(H_{a}+\rho I)^{-1}G^{t}

19: end for

5 Proposed Method

One communication round of FAGH is shown in algo. 1. In our proposed FAGH, at communication round t, server first sends the global model $w_{t-1}$ to all the available clients. Each client $C_{i}$ uses this model $w_{t-1}$ as initial model and finds local gradient $g_{i}^{t}$ and first row of the true Hessian $v_{i}^{t}$ with the help of their local data and local optimizer and shares $g_{i}^{t}$ and $v_{i}^{t}$ to the server. The server then aggregates all the local gradients to find global (average) gradient $g^{t}$ across all the clients and aggregates all the local Hessian’s first rows to find first row of the global (average) Hessian $v^{t}$ and finds their first moments (exponential moving averages) $G^{t}$ and $V^{t}$ respectively. The server utilizes $V^{t}$ to approximate the global Hessian $H_{a}$ and scales $G^{t}$ directly with the inverse of the regularized $H_{a}$ using the Sherman-Morrison formula for matrix inversion. Then the server uses this scaled $G^{t}$ to find the global Newton direction.

5.1 Hessian approximation with first row of the true Hessian

Let $w\in R^{d\times 1}=\{x_{1},x_{2},...,x_{d}\}$ be the model parameters to be optimized. We approximate the Hessian with the help of first row of the true Hessian using Statement 1.

Statement 1 The Hessian of a twice differentiable loss function F with respect to $w$ can be approximated by using eq. 5.

H_{a}=\frac{VU^{T}}{\frac{\partial^{2}F}{\partial x_{1}^{2}}}

(5)

Where, V $\in R^{d\times 1}$ = $\{\frac{\partial^{2}F}{\partial x_{1}^{2}},\frac{\partial^{2}F}{\partial x_{1}% \partial x_{2}},\frac{\partial^{2}F}{\partial x_{1}\partial x_{3}},....,\frac{% \partial^{2}F}{\partial x_{1}\partial x_{d}}\}$ is first row of the true Hessian ,
U $\in R^{d\times 1}$ = $\{\frac{\partial^{2}F}{\partial x_{1}^{2}},\frac{\partial^{2}F}{\partial x_{2}% \partial x_{1}},\frac{\partial^{2}F}{\partial x_{3}\partial x_{1}},....,\frac{% \partial 2^{F}}{\partial x_{d}\partial x_{1}}\}$ is first column of the true Hessian.
Proof of Statement 1: Let H $\in$ $R^{d\times d}$ be the true Hessian of the loss function F. Then the $(i,j)^{th}$ element of H is $H^{(i,j)}$ = $\frac{\partial^{2}F}{\partial x_{i}\partial x_{j}}$ , where i,j $\in$ {1, 2, 3, …,d}.
Now, according to eq. 5, $(i,j)^{th}$ element of approximated Hessian ( $H_{a}$ ) is found as $H_{a}^{(i,j)}$ = $\frac{V^{j}\times U^{i}}{\frac{\partial^{2}F}{\partial x_{1}^{2}}}$ , where $V^{j}$ = $\frac{\partial^{2}F}{\partial x_{1}\partial x_{j}}$ and $U^{i}$ = $\frac{\partial^{2}F}{\partial x_{i}\partial x_{1}}$ are $j^{th}$ and $i^{th}$ elements of V and U respectively. So, we can write

$H_{a}^{(i,j)}=\frac{V^{j}\times U^{i}}{\frac{\partial^{2}F}{\partial x_{1}^{2}% }}=\frac{\frac{\partial^{2}F}{\partial x_{1}\partial x_{j}}\times\frac{% \partial^{2}F}{\partial x_{i}\partial x_{1}}}{\frac{\partial^{2}F}{\partial x_% {1}^{2}}}=\frac{\frac{\partial(\frac{\partial F}{\partial x_{1}})}{\partial x_% {j}}\times\frac{\partial(\frac{\partial F}{\partial x_{i}})}{\partial x_{1}}}{% \frac{\partial(\frac{\partial F}{\partial x_{1}})}{\partial x_{1}}}$

Using chain rule of Leibniz’s notation Swokowski (1979), we rewrite the above expression by assuming that the objective function F is a non-linear function. Let, y is scalar valued output from the model (for classification task, y is considered as the output of corresponding true class for the input sample). If F is a non-linear function, then we can express $\frac{\partial F}{\partial x_{i}}$ as a function of y for all $i\in\{1,2,...,d\}$ , and we can utilize the chain rule of Leibniz’s notation to reformulate the above expression as follows. For example, if $F=\log(y)$ , then $\frac{\partial F}{\partial x_{i}}=\frac{1}{y}\frac{\partial y}{\partial x_{i}}$ , which is a function of y. Another example: if $F=(y-a)^{2}$ , then $\frac{\partial F}{\partial x_{i}}=2(y-a)\frac{\partial y}{\partial x_{i}}$ , which is also a function of y, here a is any constant.

$H_{a}^{(i,j)}=\frac{\frac{\partial(\frac{\partial F}{\partial x_{1}})}{% \partial y}\times\frac{\partial y}{\partial x_{j}}\times\frac{\partial(\frac{% \partial F}{\partial x_{i}})}{\partial y}\times\frac{\partial y}{\partial x_{1% }}}{\frac{\partial(\frac{\partial F}{\partial x_{1}})}{\partial y}\times\frac{% \partial y}{\partial x_{1}}}$

$H_{a}^{(i,j)}=\frac{\partial(\frac{\partial F}{\partial x_{i}})}{\partial y}% \times\frac{\partial y}{\partial x_{j}}=\frac{\partial^{2}F}{\partial x_{i}% \partial x_{j}}$

So, we can write $H_{a}^{(i,j)}=H^{(i,j)}$

Which indicates that we can use eq. 5 for approximation of the Hessian.

As Hessian is symmetric matrix, we can say that the first column and the first row of the Hessian are identical. So we can say U=V. Putting U=V in eq. 5 we get,

H_{a}=\frac{VV^{T}}{\frac{\partial^{2}F}{\partial x_{1}^{2}}}=ZV^{T}

(6)

So, eq. 6 helps us to approximate the full Hessian with the help of only first row of the true Hessian, where Z= $\frac{V}{\frac{\partial^{2}F}{\partial x_{1}^{2}}}$ .

5.2 Finding global Newton direction

In our proposed method FAGH, server uses global gradient $g_{t}$ and first row of the global Hessian $v_{t}$ while finding global Newton direction (where $g_{t}$ and $v_{t}$ are found by aggregating local gradients and local Hessian’s first rows respectively). Taking inspiration from ADAM Kingma and Ba (2015), we use hyper-parameters $\beta 1$ , $\beta 2$ $\in$ [0, 1) for finding the exponential moving averages $M_{1}^{t}$ and $M_{2}^{t}$ of $g^{t}$ and $v^{t}$ respectively, which helps to update the global model without forgetting knowledge gained in previous communication rounds that is well suited for partial device participation, where all the clients may not be available at a certain communication round. As $M_{1}^{t}$ and $M_{2}^{t}$ are initialized as (vectors of) 0’s, these are biased towards zero. So we use bias-corrected estimates $\widehat{M_{1}}^{t}$ and $\widehat{M_{2}}^{t}$ , which have been shown in algo. 1. We find approximated global Hessian ( $H_{a}$ ) by putting $V^{t}$ = $\widehat{M_{2}}^{t}$ in eq. 6 and use the following regularized variant of Newton type update to avoid forming indefinite global Hessian Battiti (1992).

w_{t}=w_{t-1}-\eta{(H_{a}+\rho I)}^{-1}G^{t}

(7)

Where $\rho>0$ is the regularisation parameter, I is a identity matrix $\in$ $R^{d\times d}$ and $G^{t}$ = $\widehat{M_{1}}^{t}$ . We use Sherman-Morrison formula of matrix inversion to directly compute the global Newton direction without forming and storing full regularized Hessian $(H_{a}+\rho I)$ and its inverse as shown below-

$(H_{a}+\rho I)^{-1}G^{t}$ = $(Z^{t}{V^{t}}^{T}+\rho I)^{-1}G^{t}$

$(H_{a}+\rho I)^{-1}G^{t}$ = $\frac{G^{t}}{\rho}$ - $\frac{\frac{Z^{t}{V^{t}}^{T}G^{t}}{\rho^{2}}}{1+\frac{{V^{t}}^{T}Z^{t}}{\rho}}$
where, $Z^{t}=\frac{V^{t}}{V^{t}[0]}$ and $V^{t}[0]=\frac{\partial^{2}F}{\partial x_{1}^{2}}$ is the first element of $V^{t}$ .

Refer to caption — Figure 1: Comparisons of training loss, test loss and test accuracy on CIFAR10 image classification using LeNet5

5.3 Complexities

As FAGH is associated with the computation of gradient and the first row of the true Hessian in the local client, the local time and space complexities of FAGH are both O(d + d), which is similar to existing first-order based methods like SCAFFOLD Karimireddy et al. (2020), FedDC Gao et al. (2022), etc. Second-order based methods like GIANT and DONE have a local complexity of O(md+d), which is higher than FAGH (Here $m>1$ is the number of iterations associated with approximating the Newton update). The local space complexities of both DONE and GIANT are O( d + d), which is similar to FAGH. FedNL and Basis Matters store the previous step’s Hessian matrix to approximate the current step’s Hessian in local clients, which requires $O(d^{2})$ local space complexity. Compared to this, FAGH is highly beneficial for resource-constrained local clients. In FAGH, as the server needs to retain information on the previous global gradient and the first row of the previous global Hessian to find the exponential moving averages of the current gradient and the current Hessian’s first row, which results in the requirement of O(2d + 2d) space complexity on the server. As the server directly computes the global Newton direction with the help of the Sherman-Morrison formula for matrix inversion, the overall time complexity of the server is O(d). The server’s space complexity of FAGH is only three times more than FedAvg, which is very less compared to FedSSO’s server space complexity $O(d^{2})$ . Sustaining O(4d) space complexity may not be a big issue with the host server. One issue with FAGH is discovered that, like SCAFFOLD, FAGH is associated with O(2d) client-to-server communication cost, which can be handled using proper compression techniques before communication to the server. The applicability of compression in FL can be seen from FedNL and Basis Matters, where the local Hessian is compressed before sending it to the server.

5.4 Applicability of FAGH

As the approximation of the Hessian in section 5.1 is tailored for twice-differentiable and non-linear loss function F, FAGH is applicable only to optimization problems associated with such twice-differentiable and non-linear loss functions. For example, optimizing neural networks or multinomial logistic regression (MLR) with cross-entropy loss function entails dealing with such twice-differentiable and non-linear loss function. To assess the applicability of FAGH, we conducted extensive experiments on federated image classification tasks using machine learning and deep learning models with cross-entropy loss function, from which we observed promising outcomes from FAGH.

6 Experimental setup

To validate our proposed method, we conduct extensive experiments on heterogeneously partitioned CIFAR10 Krizhevsky et al. (2009), FashionMNIST Xiao et al. (2017) and EMNIST-letters Cohen et al. (2017) datasets. CIFAR10 comprises color images ( $3\times 32\times 32$ ) of 10 classes with total 60000 samples (50000 training samples and 10000 test sample). FashionMNIST comprises grayscale images ( $1\times 28\times 28$ ) of 10 classes with total 60000 samples (50000 training samples and 10000 test sample). EMNIST-letters comprises grayscale images ( $1\times 28\times 28$ pixels) of handwritten uppercase and lowercase letters, divided into 26 classes. EMNIST-letters includes a total of 145,600 samples, with 124,800 for training and 20,800 for testing. To create heterogeneous data partitions for CIFAR10 and FashionMNIST datasets, we use the same Dirichlet distribution based heterogeneous and unbalanced partition strategy as mentioned in the papers of Yurochkin et al. and Wang et al.. We simulate $P_{i}\sim Dir_{K}(0.2)$ and find a heterogeneous partition by allocating a $P_{(i,j)}$ proportion of the samples of $i^{th}$ class to $j^{th}$ client. As we use very small value of Dirichlet distribution’s concentration parameter (0.2), each client may not get samples of all the classes, which indicates a high degree of data heterogeneity across all the clients. For our experiments, we use K=200 clients. For EMNIST dataset, we utilizes similar partitioning strategy used in the paper of McMahan et al., where the data has been sorted with the class label and then distributed. We create 400 shards of size 312 and assign 2 shards to each of the 200 clients.

For CIFAR10 image classification, we use LeNet5 model LeCun and others (2015). For FashionMNIST, we use a custom convolutional neural network (CNN) model (total 1475338 trainable parameters) with two convolutional layers and three fully connected layers. After each convolutional layer, we use batch normalization, ReLU activation and max-pooling. After first fully connected layer, we use a dropout of 0.25. For EMNIST-letters, we use multinomial logistic regression (MLR) model. For all the federated image classification tasks, we use crossentropy loss function.

We compare our algorithm with existing state-of-the-art federated learning algorithms such as SCAFFOLD, FedGA, FedExP, GIANT and DONE. To consider partial device participation in FL, we use $40\%$ of total client’s participation in each communication round. We do extensive experiments with a wide set of hyper-parameters for all the methods and find the best performing model for each method by considering minimum training $\&$ test losses and maximum test accuracy. We use FedGA $\beta$ , FAGH $\rho$ and FedExP $\epsilon$ $\in\{1,0.5,0.1,0.01,0.001\}$ , learning rate $\eta\in\{1,0.5,0.1,0.01,0.001,0.0001\}$ , FAGH $\beta_{1}=0.9$ $\&$ $\beta_{2}=0.99$ , number of Rechardson iterations for DONE=10, number of CG iterations for GIANT = 10 and $\alpha_{DONE}\in\{0.01,0.05\}$ . For FedExP, we use SGD with momentum (0.9) optimizer for finding local updates. We use total number of communication rounds T=100. We implement all the methods using Tesla V100 GPU and PyTorch1.12.1+cu102. we use seed=0 and batch size = 512. For each dataset, we use same initialization and same settings for all the existing and proposed methods.

We compare our algorithm with DONE and GIANT in MLR based classification task. We tried to compare our algorithm with DONE and GIANT in CNN based classification tasks . But unfortunately, we did not find suitable hyperparameters for DONE and GIANT for this CNN based implementation. This may be due to their assumption of strongly convex loss function.

Method	$30\%$	$35\%$	$40\%$	$45\%$
FAGH	18	29	43	78
FedGA	23	54	96	…
SCAFFOLD	38	66	…	…
FedExP	24	48	97	…

Table 1: Comparisons of communication rounds’ while achieving differnet test accuracies results on CIFAR10 with LeNet5 model. Here, ”…” means that the algorithm can not be able to achieve the accuracy within FL iterations T = 100.

Method	$60\%$	$70\%$	$80\%$	$85\%$
FAGH	7	14	36	71
FedGA	7	12	46	…
SCAFFOLD	13	35	…	…
FedExP	8	18	…	…

Table 2: Comparisons of communication rounds’ while achieving differnet test accuracies results on FashionMNIST with CNN model. Here, ”…” means that the algorithm can not be able to achieve the accuracy within FL iterations T = 100.

Method	$40\%$	$50\%$	$60\%$	$65\%$
FAGH	6	11	25	58
FedGA	11	21	46	…
SCAFFOLD	13	23	61	…
FedExP	11	19	55	…
DONE	7	12	34	96
GIANT	8	12	33	79

Table 3: Comparisons of communication rounds’ while achieving differnet test accuracies results on EMNIST with MLR model. Here, ”…” means that the algorithm can not be able to achieve the accuracy within FL iterations T = 100.

6.1 Results

Our experimental results are shown in figs. [1, 2, 3, 4, 5, 6] and tables [1, 2, 3]. From these figures, it may be observed that FAGH can decrease the train and test losses in less time and less communication rounds as compared to SCAFFOLD, FedGA, FedExP, GIANT $\&$ DONE. It also may be observed that FAGH can achieve better test accuracy at different time steps and communication rounds as compared to SCAFFOLD, FedGA, FedExP, GIANT $\&$ DONE. From the tables, it may be observed that FAGH takes comparatively less number of communication rounds for achieving different targeted test accuracies. As we use the same initialization and same settings for all the methods, we may claim that FAGH can provide faster FL training while achieving a certain precision of the global model performance in heterogeneous FL settings with partial clients participation. FAGH is easy to implement, as it has only two active tuning hyper-parameters, one is Hessian regularization parameters ( $\rho$ ) and another one is learning rate ( $\eta$ ). From our experiments, we noticed that same as ADAM Kingma and Ba (2015), we can standardize the exponential decay rates for the moment estimates of FAGH ( $\beta_{1}$ and $\beta_{2}$ ) to 0.9 and 0.99 respectively.

7 Conclusions

We proposed a new Newton optimization-based FL training method, namely FAGH, by making use of the approximated global Hessian for accelerating the convergence of global model training in FL, thereby resolving the challenge of the heavy communication overhead in FL due to a large amount of communication rounds needed to train the global model toward convergence. FAGH is beneficial for practical implementation in terms of both local and server space complexities in comparison to existing Newton-based FL training algorithms. Experimental results demonstrate that FAGH outperforms several state-of-the-art FL training methods, including SCAFFOLD, FedGA, FedExP, GIANT, and DONE, in terms of the number of communication rounds and the time required to train the global model in FL to achieve the pre-specified performance objectives. In the future, we plan to investigate how to identify a set of local clients for participating in training the global model in an adaptive and privacy-preserving manner, e.g., by leveraging learning vector quantization Qin and Suganthan (2005) and graph matching Gong et al. (2016) techniques, to further improve the convergence of the global model while kee** its performance in other aspects.

References

Agarwal et al. [2017] Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time. J. Mach. Learn. Res., 18:116:1–116:40, 2017.
Battiti [1992] Roberto Battiti. First and second-order methods for learning: Between steepest descent and newton’s method. Neural Comput., 4(2):141–166, 1992.
Bischoff et al. [2021] Sebastian Bischoff, Stephan Günnemann, Martin Jaggi, and Sebastian U Stich. On second-order optimization methods for federated learning. arXiv preprint arXiv:2109.02388, 2021.
Cohen et al. [2017] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 international joint conference on neural networks (IJCNN), pages 2921–2926. IEEE, 2017.
Dandi et al. [2022] Yatin Dandi, Luis Barba, and Martin Jaggi. Implicit gradient alignment in distributed and federated learning. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 6454–6462. AAAI Press, 2022.
Derezinski and Mahoney [2019] Michal Derezinski and Michael W. Mahoney. Distributed estimation of the inverse hessian by determinantal averaging. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 11401–11411, 2019.
Dinh et al. [2020] Canh T. Dinh, Nguyen Hoang Tran, and Tuan Dung Nguyen. Personalized federated learning with moreau envelopes. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
Dinh et al. [2022] Canh T. Dinh, Nguyen H. Tran, Tuan Dung Nguyen, Wei Bao, Amir Rezaei Balef, Bing Bing Zhou, and Albert Y. Zomaya. DONE: distributed approximate newton-type method for federated edge learning. IEEE Trans. Parallel Distributed Syst., 33(11):2648–2660, 2022.
Gao et al. [2022] Liang Gao, Huazhu Fu, Li Li, Yingwen Chen, Ming Xu, and Cheng-Zhong Xu. Feddc: Federated learning with non-iid data via local drift decoupling and correction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10102–10111. IEEE, 2022.
Gong et al. [2016] Maoguo Gong, Yue Wu, Qing Cai, Wen** Ma, A. K. Qin, Zhenkun Wang, and Licheng Jiao. Discrete particle swarm optimization for high-order graph matching. Information Sciences, 328:158–171, 2016.
Jhunjhunwala et al. [2023] Divyansh Jhunjhunwala, Shiqiang Wang, and Gauri Joshi. Fedexp: Speeding up federated averaging via extrapolation. CoRR, abs/2301.09604, 2023.
Karimireddy et al. [2020] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. SCAFFOLD: stochastic controlled averaging for federated learning. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 5132–5143. PMLR, 2020.
Ketkar and Ketkar [2017] Nikhil Ketkar and Nikhil Ketkar. Stochastic gradient descent. Deep learning with Python: A hands-on introduction, pages 113–132, 2017.
Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
LeCun and others [2015] Yann LeCun et al. Lenet-5, convolutional neural networks. URL: http://yann. lecun. com/exdb/lenet, 20(5):14, 2015.
Li et al. [2020a] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Feddane: A federated newton-type method. CoRR, abs/2001.01920, 2020.
Li et al. [2020b] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. In Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020. mlsys.org, 2020.
Li et al. [2020c] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on non-iid data. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020.
Li et al. [2021] Qinbin Li, Bingsheng He, and Dawn Song. Model-contrastive federated learning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 10713–10722, 2021.
Liu and Nocedal [1989] Dong C. Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Math. Program., 45(1-3):503–528, 1989.
Ma et al. [2022] Xin Ma, Renyi Bao, **peng Jiang, Yang Liu, Arthur Jiang, Jun Yan, Xin Liu, and Zhisong Pan. Fedsso: A federated server-side second-order optimization algorithm. CoRR, abs/2206.09576, 2022.
Martens and Grosse [2015] James Martens and Roger B. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In Francis R. Bach and David M. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 2408–2417. JMLR.org, 2015.
McMahan et al. [2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, volume 54, pages 1273–1282. PMLR, 2017.
Nazareth [2009] John L Nazareth. Conjugate gradient method. Wiley Interdisciplinary Reviews: Computational Statistics, 1(3):348–353, 2009.
Qian et al. [2022] Xun Qian, Rustem Islamov, Mher Safaryan, and Peter Richtárik. Basis matters: Better communication-efficient second order methods for federated learning. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, International Conference on Artificial Intelligence and Statistics, AISTATS 2022, 28-30 March 2022, Virtual Event, volume 151 of Proceedings of Machine Learning Research, pages 680–720. PMLR, 2022.
Qin and Suganthan [2005] A. K. Qin and P. N. Suganthan. Initialization insensitive LVQ algorithm based on cost-function adaptation. Pattern Recognition, 38(5):773–776, 2005.
Safaryan et al. [2022] Mher Safaryan, Rustem Islamov, Xun Qian, and Peter Richtárik. Fednl: Making newton-type methods applicable to federated learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 18959–19010. PMLR, 2022.
Shamir et al. [2014] Ohad Shamir, Nathan Srebro, and Tong Zhang. Communication-efficient distributed optimization using an approximate newton-type method. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Bei**g, China, 21-26 June 2014, volume 32, pages 1000–1008, 2014.
Swokowski [1979] Earl William Swokowski. Calculus with analytic geometry. Taylor & Francis, 1979.
Tan et al. [2021] Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang Yang. Towards personalized federated learning. CoRR, abs/2103.00710, 2021.
Tankaria et al. [2021] Hardik Tankaria, Dinesh Singh, and Makoto Yamada. Nys-curve: Nyström-approximated curvature for stochastic optimization. CoRR, abs/2110.08577, 2021.
Vuchkov [2022] Radoslav G Vuchkov. Hessian Approximations for Large-Scale Inverse Problems Governed By Partial Differential Equations. PhD thesis, UC Merced, 2022.
Wang et al. [2018] Shusen Wang, Farbod Roosta-Khorasani, Peng Xu, and Michael W. Mahoney. GIANT: globally improved approximate newton method for distributed optimization. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 2338–2348, 2018.
Wang et al. [2020a] Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris S. Papailiopoulos, and Yasaman Khazaeni. Federated learning with matched averaging. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
Wang et al. [2020b] Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H. Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
Xiao et al. [2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
Yurochkin et al. [2019] Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan H. Greenewald, Trong Nghia Hoang, and Yasaman Khazaeni. Bayesian nonparametric federated learning of neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 7252–7261. PMLR, 2019.