Initialisation and Topology Effects in Decentralised Federated Learning

Arash Badie-Modiri
Department of Network and Data Science,
Central European University, Austria
Department of Computer Science,
Aalto University, Finland
[email protected]
&Chiara Boldrini
The Institute of Informatics and Telematics,
National Research Council, Italy
[email protected]
&Lorenzo Valerio
The Institute of Informatics and Telematics,
National Research Council, Italy
[email protected]
&János Kertész
Department of Network and Data Science,
Central European University, Austria
[email protected]
&Márton Karsai
Department of Network and Data Science,
Central European University, Austria
[email protected]
Abstract

Fully decentralised federated learning enables collaborative training of individual machine learning models on distributed devices on a communication network while kee** the training data localised. This approach enhances data privacy and eliminates both the single point of failure and the necessity for central coordination. Our research highlights that the effectiveness of decentralised federated learning is significantly influenced by the network topology of connected devices. We propose a strategy for uncoordinated initialisation of the artificial neural networks, which leverages the distribution of eigenvector centralities of the nodes of the underlying communication network, leading to a radically improved training efficiency. Additionally, our study explores the scaling behaviour and choice of environmental parameters under our proposed initialisation strategy. This work paves the way for more efficient and scalable artificial neural network training in a distributed and uncoordinated environment, offering a deeper understanding of the intertwining roles of network structure and learning dynamics.

1 Introduction

The traditional centralised approach to machine learning has shown great progress in the last few decades. This approach, while practical, comes at a cost in terms of systemic data privacy risks and centralisation overhead [1, 2, 3]. To alleviate these issues, the federated learning framework was proposed where each node (client) updates a local machine learning model using local data and only shares its model parameters with a centralised server, which in turn aggregates these individual models into one model and redistributes it to each node [1].

While this approach reduces the data privacy risk by eliminating data sharing with the centralised server, it still maintains a singular point of failure and puts a heavy communication burden on the central server node [4, 5]. Decentralised federated learning aims to provide an alternative approach that maintains data privacy but removes the need for a centralised server. This involves the set of nodes (clients) updating their local models based on the local data, but directly communicating with one another through a communication network. Each node then updates its local model by aggregating those of the neighbourhood [4, 6]. The efficiency of this approach is impacted by several kinds of inhomogeneities [6] characterising the network structure, initial conditions, learning data, and temporal irregularities. In this paper, we focus on the first two of these.

Motivation.

Decentralised federated learning immediately raises two distinct new issues compared to the centralised federated learning approach. First, that the initialisation and operations of the nodes have to be performed in an uncoordinated manner, as the role of coordination previously lay with the server.

Second, that the effect of structure and possible heterogeneities in the communication network structure is poorly understood. In the case of centralised federated learning, the communication network is organised as a simple star graph. In a decentralised setting, however, the network might be an emergent result of, e.g., the social network of the users of the devices, a distributed peer-discovery protocol or a hand-engineered topology comprising of IoT devices. Each of these assumptions leads to a different network topology with wildly different characteristics. Many network topologies modelling real-world phenomena, unlike a star graph, have diameters that monotonically scale up with the number of nodes, inducing an inherent latency in the communication of information between nodes that are not directly connected. Structural heterogeneities, e.g., the dimensionality of the network, degree heterogeneity and heterogeneities in other centrality measures also play important roles in the evolution of the information-sharing processes on networks [7]. This makes network heterogeneities primary candidates for analysis of any decentralised system.

Contribution.

In this manuscript, we focus on two issues: First, we show that a naïve application of commonly used methods of artificial neural network initialisation results in subpar performance when training deep neural networks coupled in a communication network and we propose an alternative uncoordinated initialisation method to resolve this issue. Second, we show that the proposed initialisation causes a significant improvement in different communication network topologies and demonstrate how the topology affects the scaling properties of the learning process.

2 Related works

Decentralised federated learning [4] comes as a natural next step in the development of the field of federated learning to save communication costs, improve robustness and privacy [1]. This approach has been used in application areas such as object recognition in medical images [8, 9] and other industrial settings [10, 11]. It has also been extended by novel optimisation and aggregation methods [12, 5, 13] and theoretical advances in terms of convergence analysis [14].

The structure of complex networks, central to decentralised federated learning by coding the communication structure between connected devices [15], can embody various heterogeneities. These have been found to be a crucial factor in understanding a variety of complex systems that involve many entities communicating or interacting together. For example, the role of degree distribution [16, 17], high clustering [18, 19] or existence of flat or hierarchical community structures [20, 21] in networks of real-world phenomena has been understood and analysed for decades. Recent advances in network modelling have extended heterogeneities in networks from structural to incorporate spatial [22] and also temporal heterogeneities, induced by patterns of, e.g., spatial constraints or bursty or self-exciting activity of the nodes [23, 24, 25, 26]. In the decentralised federated learning settings, the matter of structural heterogeneities of the underlying communication network has only been very recently subjected to systemic studies. Notably, Vogels et al. [27] analyse the effect of topology on optimal learning rate and Palmieri et al. [28] analyse the differences among individual training curves of specific nodes (e.g., high-degree hubs versus peripheries) for Barabási–Albert networks and stochastic block models with two blocks [7].

On the matter of parameter initialisation in federated learning, recent studies have focused on the effect of starting all nodes from a homogeneous set of parameters [6] or the parameters of an independently pre-trained model [29]. Historically, artificial neural networks were initialised from random uniform or Gaussian distribution with scales set based on heuristics and trial and error [30] or for specific activation functions [31]. The advent of much deeper architectures and widespread use of non-linear activation functions such as ReLU or Tanh led to a methodical understanding of the role of initial parameters to avoid exploding or diminishing activations and gradients. Glorot and Bengio [32] proposed a method based on certain simplifying assumptions about the non-linearities used. Later, He et al. [33] defined a more general framework for use with a wider variety of non-linearities, which was used for training the ResNet image recognition architecture [34].

In this work, we will be extending the same approach for effective initialisation of artificial neural network parameters to the decentralised setting, where the parameters are affected not only by the optimisation based on the training data but also due to interactions with other nodes of the communication network.

3 Preliminaries

In our setup, we used a simple decentralised federated learning system with an iterative process. Nodes are connected through a predetermined static, undirected communication network G=(𝒱,)𝐺𝒱G=(\mathcal{V},\mathcal{E})italic_G = ( caligraphic_V , caligraphic_E ) where {{vi,vj}|vi,vj𝒱 and vivj}conditional-setsubscript𝑣𝑖subscript𝑣𝑗subscript𝑣𝑖subscript𝑣𝑗𝒱 and subscript𝑣𝑖subscript𝑣𝑗\mathcal{E}\subseteq\{\{v_{i},v_{j}\}|v_{i},v_{j}\in\mathcal{V}\text{ and }v_{% i}\neq v_{j}\}caligraphic_E ⊆ { { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } | italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_V and italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }. All nodes train the same artificial neural network architecture h(wi,)subscript𝑤𝑖h(w_{i},\cdot)italic_h ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ⋅ ) with i𝑖iitalic_i indicating the parameters of the i𝑖iitalic_i-th node. Moreover, at time zero, we assume that wiwji,j𝒱formulae-sequencesubscript𝑤𝑖subscript𝑤𝑗for-all𝑖𝑗𝒱w_{i}\neq w_{j}\forall i,j\in\mathcal{V}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∀ italic_i , italic_j ∈ caligraphic_V. n=|𝒱|𝑛𝒱n=|\mathcal{V}|italic_n = | caligraphic_V | indicates the number of nodes (sometimes referred to as the system size), while ki=|𝒩i|subscript𝑘𝑖subscript𝒩𝑖k_{i}=|\mathcal{N}_{i}|italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = | caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | indicates the degree of a node i𝑖iitalic_i defined as the number of its neighbours, and p(k)𝑝𝑘p(k)italic_p ( italic_k ) is the degree distribution. For a node that is reached by following a random link, q(k)𝑞𝑘q(k)italic_q ( italic_k ) is the distribution of the number of other links to that node, known as the excess degree distribution. The adjacency matrix is indicated by A𝐴Aitalic_A.

Nodes vi𝒱subscript𝑣𝑖𝒱v_{i}\in\mathcal{V}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V initialise their local model parameters based on one of the following strategies: (1) random initialisation with no gain correction, an uncoordinated approach where each node draws their initial parameters independently based on a strategy optimised for isolated centralised training, e.g., from Ref. [33]; or (2) random initialisation with gain correction, which is our proposed initialisation strategy that re-scales initial parameter distributions from (1) based on the topology of the communication network. This approach will be explored in detail in Section 4.

Each node i𝑖iitalic_i of the communication network holds a labelled local dataset Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT such that DiDj=,i,j𝒱formulae-sequencesubscript𝐷𝑖subscript𝐷𝑗for-all𝑖𝑗𝒱D_{i}\cap D_{j}=\emptyset,\forall i,j\in\mathcal{V}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∅ , ∀ italic_i , italic_j ∈ caligraphic_V which are a portion of the same global dataset D=i=1|𝒱|Di𝐷superscriptsubscript𝑖1𝒱subscript𝐷𝑖D=\bigcup_{i=1}^{|\mathcal{V}|}D_{i}italic_D = ⋃ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_V | end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Each node can access its private data but not those of other nodes and we assume that the local datasets do not change over time. All nodes train the same artificial neural network architecture h(wi,)subscript𝑤𝑖h(w_{i},\cdot)italic_h ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ⋅ ) with wipsubscript𝑤𝑖superscript𝑝w_{i}\in\mathbb{R}^{p}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT indicating the parameters of the i𝑖iitalic_i-th node solving the following Empirical Risk Minimisation problem:

minwF(wi;Di),subscript𝑤𝐹subscript𝑤𝑖subscript𝐷𝑖\displaystyle\min_{w}F(w_{i};D_{i}),roman_min start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_F ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , F(wi;Di)=1|Di|(x,y)Di(y,h(wi;x))𝐹subscript𝑤𝑖subscript𝐷𝑖1subscript𝐷𝑖subscript𝑥𝑦subscript𝐷𝑖𝑦subscript𝑤𝑖𝑥\displaystyle F(w_{i};D_{i})=\frac{1}{|D_{i}|}\sum_{(x,y)\in D_{i}}\ell(y,h(w_% {i};x))italic_F ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG | italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_y , italic_h ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_x ) ) (1)

with \ellroman_ℓ representing the loss function, which can be convex or non-convex. Moreover, at time zero we assume that wiwji,j𝒱formulae-sequencesubscript𝑤𝑖subscript𝑤𝑗for-all𝑖𝑗𝒱w_{i}\neq w_{j}\forall i,j\in\mathcal{V}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∀ italic_i , italic_j ∈ caligraphic_V The nodes perform one or more batches of local training on their local data using an optimiser. At each iteration t𝑡titalic_t, called here a communication round, the i𝑖iitalic_i-th node updates its parameters (weights and biases) using parameter values communicated by the nodes in its neighbourhood through the decentralised version of FedAvg (here called DecAvg):

wi=βiwi+j𝒩iβjwj,βi=|Di||Di|+j𝒩i|Dj|,i𝒱.formulae-sequencesubscript𝑤𝑖subscript𝛽𝑖subscript𝑤𝑖subscriptfor-all𝑗subscript𝒩𝑖subscript𝛽𝑗subscript𝑤𝑗formulae-sequencesubscript𝛽𝑖subscript𝐷𝑖subscript𝐷𝑖subscriptfor-all𝑗subscript𝒩𝑖subscript𝐷𝑗for-all𝑖𝒱w_{i}=\beta_{i}w_{i}+\sum_{\forall j\in\mathcal{N}_{i}}\beta_{j}w_{j},\quad% \beta_{i}=\frac{|D_{i}|}{|D_{i}|+\sum_{\forall j\in\mathcal{N}_{i}}|D_{j}|},% \forall i\in\mathcal{V}\,.italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT ∀ italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG | italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | + ∑ start_POSTSUBSCRIPT ∀ italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG , ∀ italic_i ∈ caligraphic_V . (2)

Since both the iid and non-iid distributions used in the manuscript provide on expectation the same number of total items per node, we can assume βi1/(ki+1)subscript𝛽𝑖1subscript𝑘𝑖1\beta_{i}\approx 1/(k_{i}+1)italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≈ 1 / ( italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 ). In this paper, we use a combination of different datasets, iid and non-iid distribution of labelled data, neural network architectures (i.e., MLP, CNN) and optimisers (i.e., SGD, Adam). For more information, refer to Appendix A.

Expected values are indicated using the brackets around the random variable, e.g., kdelimited-⟨⟩𝑘\langle k\rangle⟨ italic_k ⟩ is the mean degree. Standard deviation is shown using σ()𝜎\sigma(\ldots)italic_σ ( … ). The d𝑑ditalic_d parameters of node i𝑖iitalic_i are indicated by vector wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of size d𝑑ditalic_d, sometimes arranged in a d×n𝑑𝑛d\times nitalic_d × italic_n matrix W={wj,i}𝑊subscript𝑤𝑗𝑖W=\{w_{j,i}\}italic_W = { italic_w start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT }. σapsubscript𝜎𝑎𝑝\sigma_{ap}italic_σ start_POSTSUBSCRIPT italic_a italic_p end_POSTSUBSCRIPT indicates the mean standard deviation across columns of W𝑊Witalic_W, i.e. i=1nσ(w,i)/nsuperscriptsubscript𝑖1𝑛𝜎subscript𝑤𝑖𝑛\sum_{i=1}^{n}\sigma(w_{*,i})/n∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ ( italic_w start_POSTSUBSCRIPT ∗ , italic_i end_POSTSUBSCRIPT ) / italic_n where w,isubscript𝑤𝑖w_{*,i}italic_w start_POSTSUBSCRIPT ∗ , italic_i end_POSTSUBSCRIPT denotes column i𝑖iitalic_i of W𝑊Witalic_W, while σansubscript𝜎𝑎𝑛\sigma_{an}italic_σ start_POSTSUBSCRIPT italic_a italic_n end_POSTSUBSCRIPT indicates the mean standard deviation across rows, i.e. j=1dσ(wj,)/dsuperscriptsubscript𝑗1𝑑𝜎subscript𝑤𝑗𝑑\sum_{j=1}^{d}\sigma(w_{j,*})/d∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_σ ( italic_w start_POSTSUBSCRIPT italic_j , ∗ end_POSTSUBSCRIPT ) / italic_d, where wj,subscript𝑤𝑗w_{j,*}italic_w start_POSTSUBSCRIPT italic_j , ∗ end_POSTSUBSCRIPT denotes row j𝑗jitalic_j of W𝑊Witalic_W. Artificial neural networks are usually initialised with parameters that are drawn from different sets of distributions, e.g., weights of each layer are drawn from a separate distribution. In this case, a vector wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be formed for one specific set of parameters, drawn from a zero-mean distribution with standard deviation σinit.subscript𝜎𝑖𝑛𝑖𝑡\sigma_{init.}italic_σ start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t . end_POSTSUBSCRIPT.

4 Uncoordinated initialisation of artificial neural networks

Unlike in centralised federated learning, it is unwarranted to assume that in a massive decentralised federated learning setup, all nodes can negotiate and agree on initial values for model parameters in a manner that each parameter is exactly equal across all nodes. A fully uncoordinated method for selecting initial values for the model parameters means that each node should be able to draw initial values for their model parameters independently with few or no communication in a manner that is robust to possible communication errors.

It has been shown that judicious choice of initialisation strategy can enable training of much deeper artificial neural networks [32, 33, 34]. Specifically, a good parameter initialisation method leads to initial parameters that neither increase nor decrease activation values for consecutive layers exponentially [33, 32]. In the case of decentralised federated learning, this proves more challenging, as the aggregation step changes the distribution of parameters, meaning that the optimal initial value distributions are not only a function of the machine learning model architecture, but also affected by the communication network structure.

Empirically, we observe (Figure 1 dashed lines) that the decentralised, uncoordinated initialisation of nodes using the method proposed by He et al. [33] results in progressively poorer performance in the federated setting as the number of nodes grows, while the proposed uncoordinated initialisation performs similarly to the coordinated homogeneous strategy. Figure 1(b,d,f,h) shows this as scaling of the loss trajectory as a function of communication rounds with the number of nodes as nμsuperscript𝑛𝜇n^{\mu}italic_n start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT, with values of μ𝜇\muitalic_μ observed in range 0.4μ10.4𝜇10.4\leq\mu\leq 10.4 ≤ italic_μ ≤ 1 depending on the experimental setting.

Refer to caption
Figure 1: Mean test cross-entropy loss with the proposed initialisation (solid lines) compared to the initialisation method proposed in He et al. [33] without re-scaling (dashed lines). The decentralised federated learning process on nodes connected through (a,b) fully-connected (complete) networks with MNIST classification task on a simple multilayer perceptron with iid data distribution (c,d) Barabási–Albert networks with average degree 4, with the So2Sat LCZ42 classification task, using a simple convolutional architecture, Zipf data distribution, α=1.8𝛼1.8\alpha=1.8italic_α = 1.8, (e,f) random 4-regular networks with CIFAR-10 classification task with VGG16 architecture and (g,h) same configuration as (a,b) but using Adam optimiser with decoupled weight decay. The results show that without the proposed re-scaling of the parameters, the mean test loss has a plateau lasting a number of rounds linear or sub-linear to the system size, while our uncoordinated proposed initialisation method performs on par with the coordinated homogeneous initialisation. Bottom row (b,d,f,h) shows the empirical scaling of the test loss time trajectory of the independent [33] method initialisation with system size, with exponents ranging from 0.4 to 1. Error bars represent 95% confidence intervals.

In more realistic settings, it is often the case that, either due to faults, technical limitations or deliberate choice, not all communication channels between nodes stay open at all times, or that not all nodes participate at every round of communication. We analyse this by assuming each connection or node is active at each point in time with a probability p𝑝pitalic_p. In Figure 2 we observe that our initialisation method performs favourably compared to the initialisation method of He et al. [33] even at very low values of occupation probability p𝑝pitalic_p, with each node arriving at an eventually consistent state much earlier.

Refer to caption
Figure 2: Mean cross-entropy test loss as a function of communication rounds for a fully connected communication network n=64𝑛64n=64italic_n = 64 using the MNIST dataset with 512 items per node. Each (a) connection or (b) node are active at each round with probability p𝑝pitalic_p. Note that inactive nodes still perform local training, but are in effect momentarily isolated from the network. The proposed initialisation is displayed with solid lines and the independent initialisation method of He et al. [33] with dashed lines. Even at fairly low values of p𝑝pitalic_p, the system as a whole has a much better overall learning trajectory with our proposed parameter initialisation method compared to that of He et al. [33]. Error bars represent 95% confidence intervals.

To understand the general characteristics of the learning process we propose a simplified numerical model: an iterative process, where each of the n𝑛nitalic_n network nodes has a vector of d𝑑ditalic_d parameters drawn from a Gaussian distribution with standard deviation σinit.subscript𝜎init.\sigma_{\text{init.}}italic_σ start_POSTSUBSCRIPT init. end_POSTSUBSCRIPT and mean zero. At each iteration, similar to the federated averaging step, each node updates its parameter vector by averaging its immediate neighbourhood, then, to mimic the effects of the local training step all node parameters are updated by adding a Gaussian distributed noise with zero mean and standard deviation σnoisesubscript𝜎noise\sigma_{\text{noise}}italic_σ start_POSTSUBSCRIPT noise end_POSTSUBSCRIPT.

This model mimics the general behaviour of the decentralised learning system at the early stages of the process, since we can assume that the changes to the parameters as a result of the local learning process are generally negligible compared to the changes in parameters due to the aggregation steps. Simulations of the decentralised federated learning process (Figure 3(a,b)) can provide evidence for this assumption.

Refer to caption
Figure 3: (a) Mean magnitude of change in parameters due to training and aggregation independently as well as the total change, as well as the mean cosine similarity of the changes during training and aggregation. In the early rounds of the iterative process, the vector of change due to the aggregation is several orders of magnitude larger than that of the training. Additionally, the cosine similarity trajectory indicates the orthogonality of these vectors in the early rounds, supporting the numeric model assumption that the early evolution of the system is dominated by the aggregation step. Additionally evolution of standard deviation of σansubscript𝜎𝑎𝑛\sigma_{an}italic_σ start_POSTSUBSCRIPT italic_a italic_n end_POSTSUBSCRIPT and σapsubscript𝜎𝑎𝑝\sigma_{ap}italic_σ start_POSTSUBSCRIPT italic_a italic_p end_POSTSUBSCRIPT on (b) the distributed learning process with actual ANNs and (c) the numerical simplified model show similar early-stage dynamics. Values were calculated by (a,b) running or (c) numerically modelling the decentralised federated learning process on random 32-regular n=256𝑛256n=256italic_n = 256 networks. Panels (a,b) were performed with 80 training samples per node, 1 epoch per communication round. Error bars represent 95% confidence intervals.

The results from the simplified numeric model for random k𝑘kitalic_k-regular graphs111Random k𝑘kitalic_k-regular graphs are random graphs where each node has degree k𝑘kitalic_k. predict, as shown in Figure 3(c), that the standard deviation of the value of the same parameter across nodes averaged for all parameters, which we call σansubscript𝜎𝑎𝑛\sigma_{an}italic_σ start_POSTSUBSCRIPT italic_a italic_n end_POSTSUBSCRIPT, will decrease to some value close to the standard deviation of noise (simulating changes due to local training). Meanwhile the standard deviation of the parameters of the same node average across all nodes, σapsubscript𝜎𝑎𝑝\sigma_{ap}italic_σ start_POSTSUBSCRIPT italic_a italic_p end_POSTSUBSCRIPT, will decrease only to a factor of 1/n1𝑛1/\sqrt{n}1 / square-root start_ARG italic_n end_ARG of the original standard deviation σinit.subscript𝜎init.\sigma_{\text{init.}}italic_σ start_POSTSUBSCRIPT init. end_POSTSUBSCRIPT. Note that for artificial neural networks each layer’s initial parameters are usually drawn from distributions with different values of σinit.subscript𝜎init.\sigma_{\text{init.}}italic_σ start_POSTSUBSCRIPT init. end_POSTSUBSCRIPT, based on the number of inputs and outputs of each layer and other considerations. The analysis here can be applied to each batch of parameters drawn from the same distribution, e.g., to weights of the same layer, independently.

Two of the dynamics visualised in Figure 3(b,c) stand out in particular. First, the value towards which σapsubscript𝜎𝑎𝑝\sigma_{ap}italic_σ start_POSTSUBSCRIPT italic_a italic_p end_POSTSUBSCRIPT approaches can allow us to select the initial distribution of parameters σinit.subscript𝜎init.\sigma_{\text{init.}}italic_σ start_POSTSUBSCRIPT init. end_POSTSUBSCRIPT in a way that after stabilisation of σapsubscript𝜎𝑎𝑝\sigma_{ap}italic_σ start_POSTSUBSCRIPT italic_a italic_p end_POSTSUBSCRIPT, the neural network models would on expectation have an optimal parameter distribution. In Section 4.1, we show that this compression can be calculated for any graph based on the distribution of eigenvector centralities of the nodes [7], with the case of graphs with uniform centralities giving a factor of 1/n1𝑛1/\sqrt{n}1 / square-root start_ARG italic_n end_ARG. Second, the time to reach the steady state for σansubscript𝜎𝑎𝑛\sigma_{an}italic_σ start_POSTSUBSCRIPT italic_a italic_n end_POSTSUBSCRIPT plays an important role since this determines the number of rounds required before the improvements of the learning process start in earnest. This is because the magnitude of the changes to parameters due to the learning process (modelled by noise in the numerical model) becomes comparable to those of the aggregation process. In Section 4.3 we show that this “stabilisation” time scales similar, up to a constant factor, to the mixing time of lazy random walks on the graph.

Algorithm 1 Decentralised federated training cycle along with theproposed initialisation steps.
1:given number local batches b𝑏bitalic_b, Optimiser Opt()𝑂𝑝𝑡Opt()italic_O italic_p italic_t ( ), set of neighbours 𝒩𝒩\mathcal{N}caligraphic_N.
2:estimate vsteadynormsubscript𝑣steady\left\|v_{\text{steady}}\right\|∥ italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT ∥ based on a sketch of the communication network, vide Section 4.2.
3:for layer l𝑙litalic_l with parameters ω0lsubscriptsuperscript𝜔𝑙0\omega^{l}_{0}italic_ω start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and activation function flsubscript𝑓𝑙f_{l}italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT do
4:     initialise ω0lsubscriptsuperscript𝜔𝑙0\omega^{l}_{0}italic_ω start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT based on an initialisation method suitable for the architecture, e.g. He et al. [33]
5:     scale ω0lsubscriptsuperscript𝜔𝑙0\omega^{l}_{0}italic_ω start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by a factor of vsteady1superscriptnormsubscript𝑣steady1\left\|v_{\text{steady}}\right\|^{-1}∥ italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
6:end for
7:repeat
8:     tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1
9:     gt1BatchGrads(wt1)subscript𝑔𝑡1BatchGradssubscript𝑤𝑡1g_{t-1}\leftarrow\text{BatchGrads}(w_{t-1})italic_g start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ← BatchGrads ( italic_w start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )
10:     wtOpt(gt1,wt1)subscript𝑤𝑡𝑂𝑝𝑡subscript𝑔𝑡1subscript𝑤𝑡1w_{t}\leftarrow Opt(g_{t-1},w_{t-1})italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_O italic_p italic_t ( italic_g start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )
11:     if tmodb=0modulo𝑡𝑏0t\mod b=0italic_t roman_mod italic_b = 0 then
12:         send parameters wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to neighbours
13:         receive parameters of all neighbours as wtii𝒩subscriptsuperscript𝑤𝑖𝑡for-all𝑖𝒩w^{i}_{t}\,\forall i\in\mathcal{N}italic_w start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∀ italic_i ∈ caligraphic_N
14:         aggregate neighbourhood parameters as in Equation 2\triangleright DecAvg aggregation
15:         re-initialise optimiser state.
16:     end if
17:until stop** criteria are met

4.1 The compression of node parameters

For the simplified model we can analytically estimate the steady state values for σansubscript𝜎𝑎𝑛\sigma_{an}italic_σ start_POSTSUBSCRIPT italic_a italic_n end_POSTSUBSCRIPT and σapsubscript𝜎𝑎𝑝\sigma_{ap}italic_σ start_POSTSUBSCRIPT italic_a italic_p end_POSTSUBSCRIPT, as well as the scaling of the number of rounds to arrive at these values using methods from finite-state discrete-time Markov chains. Let A𝐴Aitalic_A be the adjacency matrix of our underlying graph G𝐺Gitalic_G. We construct a right stochastic matrix Asuperscript𝐴A^{\prime}italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT where

Aij=Aij+IijkAkj+Ikj,subscriptsuperscript𝐴𝑖𝑗subscript𝐴𝑖𝑗subscript𝐼𝑖𝑗subscript𝑘subscript𝐴𝑘𝑗subscript𝐼𝑘𝑗\displaystyle A^{\prime}_{ij}=\frac{A_{ij}+I_{ij}}{\sum_{k}A_{kj}+I_{kj}}\,,italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_I start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT + italic_I start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT end_ARG , (3)

where I𝐼Iitalic_I is the identity matrix. This corresponds to the Markov transition matrix of random walks on graph G𝐺Gitalic_G, if the random walker can stay at the same node or take one of the links connected to that node with equal probability for each possible action. This formulation can also be seen on the basis of the DecAvg aggregation in Equation 2. If we arrange all initial node parameters in a d×n𝑑𝑛d\times nitalic_d × italic_n matrix Winit.subscript𝑊init.W_{\text{init.}}italic_W start_POSTSUBSCRIPT init. end_POSTSUBSCRIPT, the parameters at round t𝑡titalic_t are determined by Winit.At+τ=0t1NτAτsubscript𝑊init.superscript𝐴𝑡subscriptsuperscript𝑡1𝜏0subscript𝑁𝜏superscript𝐴𝜏W_{\text{init.}}A^{\prime t}+\sum^{t-1}_{\tau=0}N_{\tau}A^{\prime\tau}italic_W start_POSTSUBSCRIPT init. end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT ′ italic_t end_POSTSUPERSCRIPT + ∑ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ = 0 end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT ′ italic_τ end_POSTSUPERSCRIPT, where Nτsubscript𝑁𝜏N_{\tau}italic_N start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT is a random d×n𝑑𝑛d\times nitalic_d × italic_n noise matrix with each index drawn from 𝒩(0,σnoise2)𝒩0superscriptsubscript𝜎noise2\mathcal{N}(0,\sigma_{\text{noise}}^{2})caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT noise end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

Assuming that the graph G𝐺Gitalic_G is connected, the matrix Atsuperscript𝐴𝑡A^{\prime t}italic_A start_POSTSUPERSCRIPT ′ italic_t end_POSTSUPERSCRIPT would converge to a matrix where each row is the steady state vector of the Markov matrix Asuperscript𝐴A^{\prime}italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the eigenvector corresponding to the largest eigenvalue 1, normalised to sum to 1. If the steady state vector is given as vsteadysubscript𝑣steadyv_{\text{steady}}italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT the variance contribution of the term Winit.Atsubscript𝑊init.superscript𝐴𝑡W_{\text{init.}}A^{\prime t}italic_W start_POSTSUBSCRIPT init. end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT ′ italic_t end_POSTSUPERSCRIPT along each row (i.e., expected variance of parameters of each node) is given as σinit.2vsteady2superscriptsubscript𝜎init.2superscriptnormsubscript𝑣steady2\sigma_{\text{init.}}^{2}\left\|v_{\text{steady}}\right\|^{2}italic_σ start_POSTSUBSCRIPT init. end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. It can be trivially shown, from a direct application of the Cauchy–Schwarz inequality |1,vsteady|1vsteady1subscript𝑣steadynorm1normsubscript𝑣steady|\langle\vec{1},v_{\text{steady}}\rangle|\leq\|\vec{1}\|\|v_{\text{steady}}\|\,| ⟨ over→ start_ARG 1 end_ARG , italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT ⟩ | ≤ ∥ over→ start_ARG 1 end_ARG ∥ ∥ italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT ∥ that for connected networks 1,vsteady=vsteady=11subscript𝑣steadysubscript𝑣steady1\langle\vec{1},v_{\text{steady}}\rangle=\sum v_{\text{steady}}=1⟨ over→ start_ARG 1 end_ARG , italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT ⟩ = ∑ italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT = 1, the vsteady2superscriptnormsubscript𝑣steady2\left\|v_{\text{steady}}\right\|^{2}∥ italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT term has a minimum value of 1/n1𝑛1/n1 / italic_n. This is achieved for random regular networks and other network models where nodes have uniformly distributed eigenvector centralities222Eigenvector centrality is a node characteristic  [7]) calculated from the eigenvector equation Ax=λmaxx𝐴𝑥subscript𝜆max𝑥Ax=\lambda_{\text{max}}xitalic_A italic_x = italic_λ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT italic_x, where λmaxsubscript𝜆max\lambda_{\text{max}}italic_λ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT is the greatest eigenvalue of A𝐴Aitalic_A. The eigenvector centrality of node i𝑖iitalic_i is xi/jxjsubscript𝑥𝑖subscript𝑗subscript𝑥𝑗x_{i}/\sum_{j}x_{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Here the eigenvector centrality as calculated from Asuperscript𝐴A^{\prime}italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is meant., such as Erdős–Rényi networks and lattices on d𝑑ditalic_d-dimensional tori.

Given that the noises are drawn independently, the variance contribution of the noise term τ=0t1NτAτsuperscriptsubscript𝜏0𝑡1subscript𝑁𝜏superscript𝐴𝜏\sum_{\tau=0}^{t-1}N_{\tau}A^{\prime\tau}∑ start_POSTSUBSCRIPT italic_τ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT ′ italic_τ end_POSTSUPERSCRIPT has an upper-bound of tσnoise2𝑡subscriptsuperscript𝜎2noiset\sigma^{2}_{\text{noise}}italic_t italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT noise end_POSTSUBSCRIPT. If tσnoise2σinit2/nmuch-less-than𝑡subscriptsuperscript𝜎2noisesubscriptsuperscript𝜎2init𝑛t\sigma^{2}_{\text{noise}}\ll\sigma^{2}_{\text{init}}/nitalic_t italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT noise end_POSTSUBSCRIPT ≪ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT init end_POSTSUBSCRIPT / italic_n, then the standard deviation across parameters at round t𝑡titalic_t can be approximated by limtσapσinit.vsteadysubscript𝑡subscript𝜎𝑎𝑝subscript𝜎init.normsubscript𝑣steady\lim_{t\rightarrow\infty}\sigma_{ap}\approx\sigma_{\text{init.}}\left\|v_{% \text{steady}}\right\|roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_a italic_p end_POSTSUBSCRIPT ≈ italic_σ start_POSTSUBSCRIPT init. end_POSTSUBSCRIPT ∥ italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT ∥. For a large connected random k𝑘kitalic_k-regular network, this reduces to limtσap=σinit./nsubscript𝑡subscript𝜎𝑎𝑝subscript𝜎init.𝑛\lim_{t\rightarrow\infty}\sigma_{ap}=\sigma_{\text{init.}}/\sqrt{n}roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_a italic_p end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT init. end_POSTSUBSCRIPT / square-root start_ARG italic_n end_ARG. Generally for other networks, numerical solution for vsteadynormsubscript𝑣steady\left\|v_{\text{steady}}\right\|∥ italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT ∥ can be obtained by calculating sum-normalised eigenvector centralities of the original network after adding self-loops to all nodes, with weights equal to the node degree.

These results, combined with the existing analyses on the role of artificial neural network parameters and their effect on diminishing or exploding gradients [33] suggest that it is reasonable to take into account the compression of the node parameters (e.g., the 1/n1𝑛1/\sqrt{n}1 / square-root start_ARG italic_n end_ARG factor for k𝑘kitalic_k-regular networks) when initialising the parameters, and we show in Section 4.2 that even a rough estimate of this factor would be quite effective in practice. Depending on the choice of architecture and optimiser, and especially for large networks with hundreds of nodes, this re-scaling of initial parameters can play a sizeable role in the efficacy of the training process. In our experiments, we took this into account by multiplying a gain factor of n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG in the standard deviation of layer parameters suggested by He et al. [33].

Note that vsteadysubscript𝑣steadyv_{\text{steady}}italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT is simply the sum-normalised vector of eigenvector centralities of the communication network nodes with a self-loop added to all nodes with a weight equal to the degree of that node. Each element of that vector specifies the probability of a random walk to end up on that specific node, if the random walk process has equal probability of taking any one of the edges or staying on the node. This means that the value of this gain is a factor of the system size and the distribution of network centralities. The practical application of this is explored in Section 4.2.

The numeric model applies with minimal changes to directed and weighted communication networks, similar to connected undirected networks. In the case of a strongly connected directed network, the convergence is guaranteed since the stochastic matrix Asuperscript𝐴A^{\prime}italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is aperiodic due to the existence of the self-loops. For the case of a weighted communication network, the weights are reflected in the graph adjacency matrix A𝐴Aitalic_A, with the provision that a diagonal matrix of the weights each node assigns to its own weights should be used in Equation 3 instead of the identity matrix I𝐼Iitalic_I.

4.2 Estimating parameter scaling factor vsteadynormsubscript𝑣steady\left\|v_{\text{steady}}\right\|∥ italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT ∥

Calculating the scaling factor vsteadynormsubscript𝑣steady\left\|v_{\text{steady}}\right\|∥ italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT ∥ based on a perfect knowledge of the entire communication network is trivial. In real-world scenarios, however, it is often the case that we can only rely on each node’s imperfect knowledge of the connectivity network during initialisation. In this section, we explore a few scenarios to illustrate how this affects the process described.

Often, while the full topology of the communication network might not be known to each node, the network might have emerged as a result of a central organising principle. For example, assume that the communication network is formed through a peer discovery system where a new node is assigned to be connected to existing nodes. If the assignment probability is a linear function of their current degree (i.e., a node with more neighbours is more likely to be recommended to a new node) then the resulting network would be an example of the preferential attachment process, with a power-law degree sequence [35]. Similarly, if the communication networks are based off of real-world social relationships or physical distance of the nodes, then the network would have properties similar to those observed in social or geometric networks, respectively. If such information about the network formation principles are known beforehand, fewer variables need to be estimated.

Let’s take, for example, a communication network formed based on randomly establishing connections between two nodes, or formed one based on a Barabási–Albert preferential attachment process. In these cases, with only an estimate of the number of nodes n𝑛nitalic_n, it is possible to estimate the scaling factor vsteadynormsubscript𝑣steady\left\|v_{\text{steady}}\right\|∥ italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT ∥. While this information is not necessarily available to all nodes, a simple application of a gossip protocol can provide an estimate to all nodes. It is important to note that the estimate for n𝑛nitalic_n need not be exact. Figure 4(a) shows that even a significant over- or underestimation of the number of nodes would still advantage our proposed initialisation method.

Refer to caption
Figure 4: The effect of incomplete information in form of over- or under-estimating (a) the number of nodes or (b) the vsteadynormsubscript𝑣steady\left\|v_{\text{steady}}\right\|∥ italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT ∥ size exponent (vide Figure 5(a,b)) on our proposed initialisation method. Note that our proposed initialisation still performs significantly better than unscaled independent initialisation, shown with dashed lines. Performed on the MLP configuration with MNIST dataset, on fully connected network. Error bars represent 95% confidence intervals

If no information on the network topology is known in advance, it is possible to arrive at a best guess by polling a sample of the network (perhaps through a gossip protocol) for a degree distribution. The scaling of vsteadynormsubscript𝑣steady\left\|v_{\text{steady}}\right\|∥ italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT ∥ with system size in random networks with different distributions are illustrated in Figure 5 where the value of the scaling factor exponent is derived for Erdős–Rényi, k𝑘kitalic_k-regular, Barabási–Albert and heavy-tail degree distribution configuration model random networks with the same size and (on expectation) the same number of links. The value of vsteadynormsubscript𝑣steady\left\|v_{\text{steady}}\right\|∥ italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT ∥ is simply nαsuperscript𝑛𝛼n^{-\alpha}italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT where α𝛼\alphaitalic_α is the scaling factor exponent in Figure 5(a,b).

Furthermore, we show empirically that vsteadynormsubscript𝑣steady\left\|v_{\text{steady}}\right\|∥ italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT ∥ is independent of the degree assortativity of the network. This is done by rewiring a network using the edge swap method, i.e. selecting two edges and swap** the endpoints. This is performed through simulated annealing: a specific target value for degree-assortativity is set and random edge-swaps are accepted or rejected based on their utility and a temperature variable that decreases slowly over time, until the network converges to the desired target assortativity. The results in Figure 5(c) show that vsteadynormsubscript𝑣steady\left\|v_{\text{steady}}\right\|∥ italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT ∥ stays the same after rewiring.

Refer to caption
Figure 5: (a,b) The effect of heterogeneous distribution of centralities in the scaling factor vsteadynormsubscript𝑣steady\left\|v_{\text{steady}}\right\|∥ italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT ∥ from the simplified numerical model. Homogeneous random networks (Erdős–Rényi G(n,p)𝐺𝑛𝑝G(n,p)italic_G ( italic_n , italic_p ) networks and random k𝑘kitalic_k-regular networks) display vsteady1/nnormsubscript𝑣steady1𝑛\left\|v_{\text{steady}}\right\|\approx 1/\sqrt{n}∥ italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT ∥ ≈ 1 / square-root start_ARG italic_n end_ARG, while Barabási–Albert networks and configuration model heavy-tail degree distribution networks (with degree distribution p(k)kγsimilar-to𝑝𝑘superscript𝑘𝛾p(k)\sim k^{-\gamma}italic_p ( italic_k ) ∼ italic_k start_POSTSUPERSCRIPT - italic_γ end_POSTSUPERSCRIPT) show this factor scaling exponentially with the number of nodes with different exponents that is itself a function of γ𝛾\gammaitalic_γ. (c) Degree distribution preserving rewiring of Erdős–Rényi network (n=2048𝑛2048n=2048italic_n = 2048) to produce networks with various values of degree assortativity ρ𝜌\rhoitalic_ρ shows that vsteadynormsubscript𝑣steady\left\|v_{\text{steady}}\right\|∥ italic_v start_POSTSUBSCRIPT steady end_POSTSUBSCRIPT ∥ is not affected by degree assortativity. Error bars represent 95% confidence intervals.

4.3 Initial stabilisation time

The stabilisation time of σansubscript𝜎𝑎𝑛\sigma_{an}italic_σ start_POSTSUBSCRIPT italic_a italic_n end_POSTSUBSCRIPT, the number of communication rounds until the blue curve in Figure 3(b,c) flattens out, determines the number of rounds where local training has a negligible effect on the parameters. Understanding the scaling of stabilisation time with the number of nodes and other environmental parameters is important, as before this stabilisation the aggregation process dominates the local training process by several orders of magnitude (Figure 3(a)), inhibiting effective training.

Deriving the scaling of number of stabilisation rounds with number of nodes n𝑛nitalic_n is a matter of calculating the mixing time of the Markov matrix Asuperscript𝐴A^{\prime}italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. The problem is remarkably close to the lazy random walk, where at each step the walker might stay with probability 1/2121/21 / 2 or select one of the links for the next transition. However, in our case this staying probability is lower or equal to that of lazy random walk, being equal to 1/(ki+1)1subscript𝑘𝑖11/(k_{i}+1)1 / ( italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 ) where kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the degree of node i𝑖iitalic_i. It has been shown that, since the staying probabilities are bounded in (0,1)01(0,1)( 0 , 1 ), the mixing time of the random walk process described here grows asymptotically with that of lazy random walk up to a constant factor [36, Corollary 9.5].

Mixing time of lazy random-walks on graphs is a subject of active study. Lattices on d𝑑ditalic_d-dimensional tori have a mixing time with an upper bound at d2l2superscript𝑑2superscript𝑙2d^{2}l^{2}italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [37, Theorem 5.5] where ln1/dproportional-to𝑙superscript𝑛1𝑑l\propto n^{1/d}italic_l ∝ italic_n start_POSTSUPERSCRIPT 1 / italic_d end_POSTSUPERSCRIPT is the linear system size. Connected random k𝑘kitalic_k-regular networks, as expander graphs, have a mixing time of O(logn)𝑂𝑛O(\log n)italic_O ( roman_log italic_n ) [38, 39], while connected supercritical Erdős–Rényi (G(n,m)𝐺𝑛𝑚G(n,m)italic_G ( italic_n , italic_m ) and G(n,p)𝐺𝑛𝑝G(n,p)italic_G ( italic_n , italic_p )) graphs (with average degree larger than 1) have lazy random walk mixing times of O(log2n)𝑂superscript2𝑛O(\log^{2}n)italic_O ( roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n ) [40, 41]. Generally, for a given Markov matrix the convergent rate can be estimated based on its spectral gap [42].

5 Conclusion

Here we introduced a fully uncoordinated, decentralised artificial neural network initialisation method that provides a significantly improved training trajectory, while solely relying on the macroscopic properties of the communication network. We also showed that the initial stages of the uncoordinated decentralised federated learning process are governed by dynamics similar to those of the lazy random walk on graphs. Furthermore, we also showed empirically (Appendix B) that when using the proposed initialisation method, the test loss of the decentralised federated learning system can approach that of a centralised system with the same total number of training samples, and that the final outcome, in terms of the best test loss achieved, is fairly robust to different network densities and momentary communication failures, and it can benefit from more frequent communication between the nodes.

5.1 Limitations

In this work, we have not considered an unequal allocation of computation power among the nodes to focus solely on the role of the initialisation and the network. In real-world settings, these are often combined or correlated with network properties such as the degree or other centrality measures, which might affect the efficacy of the decentralised federated learning process. Understanding the combination and interactions of these properties with network features adds another layer of interdependency and complexity to the problem, which most certainly was not addressable without first studying the simpler case presented here. The prospect of extending this work to these more complex settings is interesting to consider.

The federated learning process presented here does not support heterogeneous machine learning architectures between nodes. We expect this to become more prominent with the advances in edge computing and device availability. We also did not consider possible heterogeneities in node-to-node communication patterns, such as burstiness or diurnal pattern, which has been shown to affect the rapidity of other network dynamics like spreading and percolation processes [23, 25, 26].

Additionally, some artificial neural network architectures utilise batch normalisation [43], which seemingly greatly limits (but does not eliminate [33]) the vanishing/exploding gradients issue that necessitates careful parameter initialisation. It is important to note that this choice greatly reduces options in the choice of architecture or risks introducing gradient explosion at initial training steps, making deep networks of arbitrary structure prohibitively difficult to train [44]. Careful parameter initialisation, on the other hand, provides a more generalisable solution.

Our work enables uncoordinated decentralised federated learning that can efficiently train a model using all the data available to all nodes without having the nodes share data directly with a centralised server or with each other. While this enables or streamlines novel use cases, it is important to note that trained machine learning models themselves could be exploited to extract information about the training data [45, 46]. It is therefore important to not view federated learning as a panacea for data privacy issues but to view direct data sharing as the weakest link in data privacy.

Acknowledgments and Disclosure of Funding

We acknowledge the EuroHPC Joint Undertaking for awarding this project access to the EuroHPC supercomputer LUMI, hosted by CSC (Finland) and the LUMI consortium through a EuroHPC Regular Access call. We also acknowledge the computational resources provided by the Aalto Science-IT project.

This research was supported by CHIST-ERA-19-XAI010 SAI projects, FWF (grant No. I 5205). JK acknowledges partial support by ERC grant No. 810115-DYNASNET. MK acknowledges support from the ANR project DATAREDUX (ANR-19-CE46-0008); the SoBigData++ H2020-871042 project; and the National Laboratory for Health Security, Alfréd Rényi Institute, RRF-2.3.1-21-2022-00006.

References

  • McMahan et al. [2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017. URL http://proceedings.mlr.press/v54/mcmahan17a.html.
  • Kairouz et al. [2021] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021. doi: 10.1561/2200000083. URL https://doi.org/10.1561/2200000083.
  • Rieke et al. [2020] Nicola Rieke, Jonny Hancox, Wenqi Li, Fausto Milletari, Holger R Roth, Shadi Albarqouni, Spyridon Bakas, Mathieu N Galtier, Bennett A Landman, Klaus Maier-Hein, et al. The future of digital health with federated learning. NPJ digital medicine, 3(1):119, 2020. doi: 10.1038/S41746-020-00323-1. URL https://doi.org/10.1038/S41746-020-00323-1.
  • Beltrán et al. [2023] Enrique Tomás Martínez Beltrán, Mario Quiles Pérez, Pedro Miguel Sánchez Sánchez, Sergio López Bernal, Gérôme Bovet, Manuel Gil Pérez, Gregorio Martínez Pérez, and Alberto Huertas Celdrán. Decentralized federated learning: Fundamentals, state of the art, frameworks, trends, and challenges. IEEE Commun. Surv. Tutorials, 25(4):2983–3013, 2023. doi: 10.1109/COMST.2023.3315746. URL https://doi.org/10.1109/COMST.2023.3315746.
  • Lalitha et al. [2018] Anusha Lalitha, Shubhanshu Shekhar, Tara Javidi, and Farinaz Koushanfar. Fully decentralized federated learning. In Third workshop on bayesian deep learning (NeurIPS), volume 2, 2018. URL http://bayesiandeeplearning.org/2018/papers/140.pdf.
  • Valerio et al. [2023] Lorenzo Valerio, Chiara Boldrini, Andrea Passarella, János Kertész, Márton Karsai, and Gerardo Iñiguez. Coordination-free decentralised federated learning on complex networks: Overcoming heterogeneity. arXiv preprint 2312.04504, 2023. URL https://arxiv.longhoe.net/abs/2312.04504.
  • Newman [2010] Mark E. J. Newman. Networks: An introduction. Oxford UP, 2010. doi: 10.1093/acprof:oso/9780199206650.001.0001.
  • Roy et al. [2019] Abhijit Guha Roy, Shayan Siddiqui, Sebastian Pölsterl, Nassir Navab, and Christian Wachinger. Braintorrent: A peer-to-peer environment for decentralized federated learning. arXiv preprint 1905.06731, 2019. URL https://arxiv.longhoe.net/abs/1905.06731.
  • Tedeschini et al. [2022] Bernardo Camajori Tedeschini, Stefano Savazzi, Roman Stoklasa, Luca Barbieri, Ioannis Stathopoulos, Monica Nicoli, and Luigi Serio. Decentralized federated learning for healthcare networks: A case study on tumor segmentation. IEEE Access, 10:8693–8708, 2022. doi: 10.1109/ACCESS.2022.3141913. URL https://doi.org/10.1109/ACCESS.2022.3141913.
  • Savazzi et al. [2021] Stefano Savazzi, Monica Nicoli, Mehdi Bennis, Sanaz Kianoush, and Luca Barbieri. Opportunities of federated learning in connected, cooperative, and automated industrial systems. IEEE Communications Magazine, 59(2):16–21, 2021. doi: 10.1109/MCOM.001.2000200. URL https://doi.org/10.1109/10.1109/MCOM.001.2000200.
  • Qu et al. [2020] Youyang Qu, Shiva Raj Pokhrel, Sahil Garg, Longxiang Gao, and Yong Xiang. A blockchained federated learning framework for cognitive computing in industry 4.0 networks. IEEE Transactions on Industrial Informatics, 17(4):2964–2973, 2020. doi: 10.1109/TII.2020.3007817. URL https://doi.org/10.1109/TII.2020.3007817.
  • Sun et al. [2022] Tao Sun, Dongsheng Li, and Bao Wang. Decentralized federated averaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4289–4301, 2022. doi: 10.1109/TPAMI.2022.3196503. URL https://doi.org/10.1109/TPAMI.2022.3196503.
  • Lian et al. [2017] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, volume 30, 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/f75526659f31040afeb61cb7133e4e6d-Paper.pdf.
  • Koloskova et al. [2020] Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian Stich. A unified theory of decentralized sgd with changing topology and local updates. In International Conference on Machine Learning, pages 5381–5393. PMLR, 2020. URL https://proceedings.mlr.press/v119/koloskova20a.html.
  • Yuan et al. [2024] Liangqi Yuan, Ziran Wang, Lichao Sun, Philip S. Yu, and Christopher G. Brinton. Decentralized federated learning: A survey and perspective. 2024. URL https://arxiv.longhoe.net/pdf/2306.01603.
  • Jennings [1937] Helen Jennings. Structure of leadership-development and sphere of influence. Sociometry, 1(1/2):99–143, 1937. doi: 0.2307/2785262. URL https://doi.org/0.2307/2785262.
  • Albert et al. [2000] Réka Albert, Hawoong Jeong, and Albert-László Barabási. Error and attack tolerance of complex networks. nature, 406(6794):378–382, 2000. doi: 10.1038/35019019. URL https://doi.org/10.1038/35019019.
  • Watts and Strogatz [1998] Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘small-world’ networks. nature, 393(6684):440–442, 1998. doi: 10.1038/30918. URL https://doi.org/10.1038/30918.
  • Luce and Perry [1949] R Duncan Luce and Albert D Perry. A method of matrix analysis of group structure. Psychometrika, 14(2):95–116, 1949. doi: 10.1007/BF02289146. URL https://doi.org/10.1007/BF02289146.
  • Rice [1927] Stuart A Rice. The identification of blocs in small political bodies. American Political Science Review, 21(3):619–627, 1927. doi: 10.2307/1945514. URL https://doi.org/10.2307/1945514.
  • Fortunato and Hric [2016] Santo Fortunato and Darko Hric. Community detection in networks: A user guide. Physics reports, 659:1–44, 2016. doi: 10.1016/j.physrep.2016.09.002. URL https://doi.org/10.1016/j.physrep.2016.09.002.
  • Orsini et al. [2015] Chiara Orsini, Marija M Dankulov, Pol Colomer-de Simón, Almerima Jamakovic, Priya Mahadevan, Amin Vahdat, Kevin E Bassler, Zoltán Toroczkai, Marián Boguná, Guido Caldarelli, et al. Quantifying randomness in real networks. Nature communications, 6(1):8627, 2015. doi: 10.1038/ncomms9627. URL https://doi.org/10.1038/ncomms9627.
  • Karsai et al. [2011] Márton Karsai, Mikko Kivelä, Raj Kumar Pan, Kimmo Kaski, János Kertész, A-L Barabási, and Jari Saramäki. Small but slow world: How network topology and burstiness slow down spreading. Physical Review E, 83(2):025102, 2011. doi: 10.1103/PhysRevE.83.025102. URL https://doi.org/10.1103/PhysRevE.83.025102.
  • Gauvin et al. [2022] Laetitia Gauvin, Mathieu Génois, Márton Karsai, Mikko Kivelä, Taro Takaguchi, Eugenio Valdano, and Christian L Vestergaard. Randomized reference models for temporal networks. SIAM Review, 64(4):763–830, 2022. doi: 10.1137/19M1242252. URL https://doi.org/10.1137/19M1242252.
  • Badie-Modiri et al. [2022a] Arash Badie-Modiri, Abbas K Rizi, Márton Karsai, and Mikko Kivelä. Directed percolation in random temporal network models with heterogeneities. Physical Review E, 105(5):054313, 2022a. doi: 10.1103/PhysRevE.105.054313. URL https://doi.org/10.1103/PhysRevE.105.054313.
  • Badie-Modiri et al. [2022b] Arash Badie-Modiri, Abbas K Rizi, Márton Karsai, and Mikko Kivelä. Directed percolation in temporal networks. Physical Review Research, 4(2):L022047, 2022b. doi: 10.1103/PhysRevResearch.4.L022047. URL https://doi.org/10.1103/PhysRevResearch.4.L022047.
  • Vogels et al. [2022] Thijs Vogels, Hadrien Hendrikx, and Martin Jaggi. Beyond spectral gap: The role of the topology in decentralized learning. In Advances in Neural Information Processing Systems, volume 35, pages 15039–15050, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/61162d94822d468ee6e92803340f2040-Paper-Conference.pdf.
  • Palmieri et al. [2023] Luigi Palmieri, Lorenzo Valerio, Chiara Boldrini, and Andrea Passarella. The effect of network topologies on fully decentralized learning: a preliminary investigation. In Proceedings of the 1st International Workshop on Networked AI Systems, pages 1–6, 2023. doi: 10.1145/3597062.3597280. URL https://doi.org/10.1145/3597062.3597280.
  • Nguyen et al. [2022] John Nguyen, Kshitiz Malik, Maziar Sanjabi, and Michael Rabbat. Where to begin? exploring the impact of pre-training and initialization in federated learning. arXiv preprint 2206.15387, 2022. URL https://arxiv.longhoe.net/abs/2206.15387.
  • Goodfellow et al. [2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016. doi: 10.1038/nature14539. URL https://doi.org/10.1038/nature14539.
  • LeCun et al. [2002] Yann LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–50. Springer, 2002. doi: 10.1007/978-3-642-35289-8_3. URL https://doi.org/10.1007/978-3-642-35289-8_3.
  • Glorot and Bengio [2010] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010. URL https://proceedings.mlr.press/v9/glorot10a.html.
  • He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015. doi: 10.1109/ICCV.2015.123. URL https://doi.org/10.1109/ICCV.2015.123.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. doi: 10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90.
  • Newman [2005] Mark EJ Newman. Power laws, pareto distributions and zipf’s law. Contemporary physics, 46(5):323–351, 2005. doi: 10.1080/00107510500052444.
  • Peres and Sousi [2015] Yuval Peres and Perla Sousi. Mixing times are hitting times of large sets. Journal of Theoretical Probability, 28(2):488–519, 2015. doi: 10.1007/s10959-013-0497-9. URL https://doi.org/10.1007/s10959-013-0497-9.
  • Levin and Peres [2017] David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
  • Barzdin [1993] Ya M Barzdin. On the realization of networks in three-dimensional space. Selected Works of AN Kolmogorov: Volume III: Information Theory and the Theory of Algorithms, pages 194–202, 1993. doi: 10.1007/978-94-017-2973-4_11. URL https://doi.org/10.1007/978-94-017-2973-4_11.
  • Pinsker [1973] Mark S Pinsker. On the complexity of a concentrator. In 7th International Telegraffic Conference, volume 4, pages 1–318. Citeseer, 1973. URL https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=71c9fd11ff75889aaa903b027af3a06e750e8add.
  • Fountoulakis and Reed [2008] Nikolaos Fountoulakis and Bruce A Reed. The evolution of the mixing rate of a simple random walk on the giant component of a random graph. Random Structures & Algorithms, 33(1):68–86, 2008. doi: 10.1002/rsa.20210. URL https://doi.org/10.1002/rsa.20210.
  • Benjamini et al. [2014] Itai Benjamini, Gady Kozma, and Nicholas Wormald. The mixing time of the giant component of a random graph. Random Structures & Algorithms, 45(3):383–407, 2014. doi: 10.1002/rsa.20539. URL https://doi.org/10.1002/rsa.20539.
  • Mufa [1996] Chen Mufa. Estimation of spectral gap for markov chains. Acta Mathematica Sinica, 12(4):337–360, 1996. doi: 10.1214/19-EJS1563.
  • Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015. doi: 10.5555/3045118.3045167.
  • Yang et al. [2019] Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, and Samuel S Schoenholz. A mean field theory of batch normalization. arXiv:1902.08129, 2019.
  • Carlini et al. [2021] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021. URL https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting.
  • Carlini et al. [2023] Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramer, Borja Balle, Daphne Ippolito, and Eric Wallace. Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23), pages 5253–5270, 2023. URL https://www.usenix.org/conference/usenixsecurity23/presentation/carlini.
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.
  • LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791. URL https://doi.org/10.1109/5.726791.
  • Zhu et al. [2020] ** Qiu, Yilei Shi, Jian Kang, Lichao Mou, Hossein Bagheri, Matthias Haberle, Yuansheng Hua, Rong Huang, et al. So2sat lcz42: A benchmark data set for the classification of global local climate zones [software and data sets]. IEEE Geoscience and Remote Sensing Magazine, 8(3):76–89, 2020. doi: 10.1109/MGRS.2020.2964708.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Henighan et al. [2020] Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint 2010.14701, 2020. URL https://arxiv.longhoe.net/abs/2010.14701.

Appendix A Datasets, implementation and experimental architecture

The artificial neural network architectures employed in this manuscript are a simple feedforward neural network with four fully connected layers, consisting of 512, 256, and 128 neurons in three hidden layers, followed by an output layer of size 10, and employs ReLU activation functions after each layer except the output layer, a simple feedforward neural network consisting of 3 2D convolutional layers with 32, 64 and 64 output channels, each with 3 kernels and one pixel padding of zeros, followed by two fully connected linear hidden layers of size 128 and 64 and one output layer and the VGG16 architecture [47]. Of course, as discussed in the literature, the effects of the initialisation method would be even more visible in deeper neural network architectures [33, 32].

Our experiments will be performed on subsets of the MNIST digit classification task [48], the So2Sat LCZ42 dataset for local climate zone classification [49] and the CIFAR-10 image classification dataset [50], distributed between nodes either iid or non-iid based on a Zipf distribution. In terms of local optimisation, we tested stochastic gradient decent with momentum and Adam with decoupled weight decay, although empirical evidence (Figure 3) hints that the the effects of local optimisation are only non-negligible compared to those of the aggregation at a longer time-scale than the one mainly of interest in this manuscript.

The MNIST dataset was released under the Creative Commons 3.0 Attribution-ShareAlike license. Available at http://yann.lecun.com/exdb/mnist/.

The So2Sat LCZ42 dataset was released under the Creative Commons 4.0 Attribution licence, at https://mediatum.ub.tum.de/1613658. In our use case, we used a random subset of the random split of the third version, only including the 10 bands from the Sentinel-2 satellite to artificially simulate a more realistic, data-poor scenario.

The CIFAR-10 (Canadian Institute for Advanced Research, 10 classes) dataset was released under the Creative Commons 0 version 1.0, available at https://www.cs.toronto.edu/~kriz/cifar.html.

Table 1 shows in brief the infromation about the configurations used for producing figures in the manuscript.

Table 1: Configurations used in this manuscript. Stochastic gradient descent used momentum m=0.5𝑚0.5m=0.5italic_m = 0.5, while Adam optimiser with decoupled weight decay was initialised with parameter β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999, ϵ=108italic-ϵsuperscript108\epsilon=10^{-8}italic_ϵ = 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT and λ=102𝜆superscript102\lambda=10^{-2}italic_λ = 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT. Both Optimisers used a learning rate of 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. All configurations used a minibatch size of 16 and 8 minibatches of local training per communication round.
Cfg. Dataset Architecture Comm. net. Optimiser Data dist. Items per node
A MNIST MLP Full SGD iid 512
B So2Sat CNN+MLP BA (m=8) SGD Zipf (α=1.8𝛼1.8\alpha=1.8italic_α = 1.8) 1024
C CIFAR-10 VGG-16 4-regular SGD iid 512
D MNIST MLP Full AdamW iid 512

Runtimes for and configuration used is reported in Table 2 for the purpose of reproduction.

The implementation of the full-fidelity simulated decentralised federated learning system is available for the purposes of reproduction under the MIT open-source license at https://anonymous.4open.science/r/sat-9D08.

Table 2: Median runtime (in minutes) for a single realisation of each configuration, and the total runtime of all realisations for each configuration. Scaling refers to Figure 1, Estimates to Figure 4, and Probs to Figure 2. Runtimes were measured on AMD MI250X GPUs with two realisations running concurrently per graphics compute die (GCD) to increase efficiency. Configuration labels refer to Table 1. Note that if the preliminary experiments and the experiments presented solely in the appendices are taken into account, the total computation costs increases to roughly 10000–12000 GCD-hours.
Configuration Size (nodes) Median runtime (mins) Total runtime (hours)
A (Scaling) 8 9.3 10.1
16 20.1 22.4
32 51.2 55.1
64 137.1 148.3
A (Estimates) 64 136.2 642.4
A (Probs) 64 77.4 1466
B (Scaling) 32 95.0 405.2
64 177.7 757.7
128 340.2 1499.4
256 716.8 1530.1
C (Scaling) 8 47.5 25.7
16 92.9 49.7
32 167.1 90.0
64 314.9 169.87
D (Scaling) 8 10.54 5.5
16 23.05 12.0
32 51.42 27.3
64 126.8 68.2
Total runtime 6985.0
Total GCD-hours 3492.5

Appendix B Scalability and the role of exogenous and endogenous decentralised federated learning parameters

As shown before in Figure 1, the choice of initialisation strategy significantly affects the behaviour of the system when varying the environment parameter such as the number of nodes. In this section, we will briefly discuss the effect of network topology on the learning trajectory of the system, then systematically analyse the role of different environmental parameters such as the system size (number of nodes), the communication network density, the training sample size and the frequency of communication between nodes in the trajectory of the decentralised federated learning, when using the initialisation method proposed in Section 4. As most of these quantities are involved in some form of cost–benefit trade-off, understanding the changes in behaviour due to each one can allow a better grasp of the system behaviour at larger scales.

For the rest of this section, however, we limited the analysis to a single topology, random k𝑘kitalic_k-regular networks, to focus on a more in-depth analysis of the role of environmental parameters other than the network topology, such as the system size, frequency of communication, and network density.

For the purposes of this section, we make the simplifying assumption that the communication time is negligible compared to the training time. In some cases, we introduce “wall-clock equivalent” values, indicating the computation time spent by an individual node up to communication round t𝑡titalic_t, multiplied by the number of training mini-batches of training between two rounds of communication. This “wall-clock equivalent” can be seen as a linear scaling of the communication rounds t𝑡titalic_t.

Refer to caption
Figure 6: Trajectory of mean test cross-entropy loss over communication rounds for (a) connected random k𝑘kitalic_k-regular networks with n=64𝑛64n=64italic_n = 64 nodes and different values for degree k𝑘kitalic_k, with 80 balanced training samples per node, (b) 32-regular random network with different number of total labelled training samples, balanced across classes, assigned to each item, (c) with different number of nodes and (d) with different number of local epochs between communications. In all panels, the horizontal dashed lines correspond to the best test loss of a central system with the same amount of the total number of training samples as the entire decentralised federated learning system simulated. Error bars represent 95% confidence intervals. The horizontal axes in (b,d) are scaled to show the “wall-clock equivalent”, a value linearly comparable to the total computation cost of a single node until round t𝑡titalic_t.

Network density.

The number of links in the communication network directly increases the communication burden on the nodes. Our results (Figure 6(a) show that while a very small value for the average degree affects the rapidity of the training convergence disproportionately, as long as the average degree is significantly larger than the critical threshold for connectivity, i.e., for random network models with average excess degree q(k)1much-greater-thandelimited-⟨⟩𝑞𝑘1\langle q(k)\rangle\gg 1⟨ italic_q ( italic_k ) ⟩ ≫ 1, the trajectory will be quite consistent across different network densities. Note that, although in Section 4.3 we were mostly concerned with the scaling of the initial mixing time with the number of nodes, in many cases this would also benefit from a higher average degree. Also note that average degrees close to the critical threshold might not prove practical or desirable for the communication network in the first place, as the network close to the critical threshold is highly susceptible to fragmentation with the cutting off of even very few links.

Training samples per node.

Assuming that each device is capable of performing training on a constant rate of mini-batches per unit of time, more training samples per node increase the total amount of training data, while also linearly increasing the training time for every epoch. Our results (Figure 6(b)) show that (1) the test loss approaches that of a centralised system with the same number of total training samples, and (2) that the trajectory of test loss with effective wall-clock time remains consistent.

System size and total computation cost.

The number of nodes in the network affects the training process in multiple ways. If a larger system size is synonymous with a proportionate increase in the total number of training samples available to the system as a whole, it is interesting to see if the system is capable of utilising those in the same way as an increase in the number of items per node would. Our results (Figure 6(c)) show that if the increase in size coincides with an increase in the total number of items, the system is able to effectively utilise these, always approaching the test loss limit of a centralised system with the same total data.

Another aspect is that an increase in the number of nodes would mean an increase in the total computation cost, so it would be interesting to analyse if this increase (without a corresponding increase in the total amount of data) would result in any improvements in the learning trajectory. In short, our results in Figure 7 show that if the same amount of data is spread across more nodes, each node will have to train on roughly the same number of minibatches to arrive at a similar test loss, and that this result is even consistent with the learning trajectory of the centralised single node scenario.

Refer to caption
Figure 7: (a) Trajectory of mean test cross-entropy loss over wall-clock time equivalent over 32-regular random graphs and for an isolated node while kee** the total number of training samples across the whole system constant. Each node was assigned training samples balanced across 10 classes, with a total of 40 960 training samples divided equally across the nodes. Error bars represent 95% confidence intervals. The horizontal dashed line corresponds to the best test loss of a central system with the same total amount of training samples as the entire decentralised federated learning system simulated. The sloped dashed line shows the power-law trajectory of loss with equivalent to wall-clock time consistent with results from Henighan et al. [51]. (b) “Slowdown” as a result of splitting the same total amount of training samples across more nodes. Values are calculated as τ1(0.11)/τn(0.11)subscript𝜏10.11subscript𝜏𝑛0.11\tau_{1}(0.11)/\tau_{n}(0.11)italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0.11 ) / italic_τ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( 0.11 ) where τn(l)subscript𝜏𝑛𝑙\tau_{n}(l)italic_τ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_l ) is the wall-clock time equivalent (values of horizontal axis from panel (a)) required for a system with n𝑛nitalic_n nodes to reach cross-entropy loss value l𝑙litalic_l.

Communication frequency.

Finally, we consider the role of frequency of communication in the trajectory of loss, manifested as the number of local training epochs between communications. It has been shown in the context of decentralised parallel stochastic gradient descent that a higher frequency of communications increases the efficacy of the training process, as it prevents a larger drift [13]. While a similar phenomenon in the context of an uncoordinated decentralised federated learning seems plausible, showing this relationship empirically on a system of reasonable size was fraught with difficulties due to the issues discussed in Figure 1. Utilising the proposed initialisation method enables this and allows us to confirm (Figure 6(d)) that while more frequent communication increases the communication burden on the entire network, more frequent communication translates to both a lower final test loss as well as faster convergence.