On the Trade-off between Flatness and Optimization
in Distributed Learning
Abstract
This paper proposes a theoretical framework to evaluate and compare the performance of gradient-descent algorithms for distributed learning in relation to their behavior around local minima in nonconvex environments. Previous works have noticed that convergence toward flat local minima tend to enhance the generalization ability of learning algorithms. This work discovers two interesting results. First, it shows that decentralized learning strategies are able to escape faster away from local minimizers and favor convergence toward flatter minima relative to the centralized solution in the large-batch training regime. Second, and importantly, the ultimate classification accuracy is not solely dependent on the flatness of the local minimizer but also on how well a learning algorithm can approach that minimum. In other words, the classification accuracy is a function of both flatness and optimization performance. The paper examines the interplay between the two measures of flatness and optimization error closely. One important conclusion is that decentralized strategies of the diffusion type deliver enhanced classification accuracy because it strikes a more favorable balance between flatness and optimization performance.
Index Terms:
Flatness, esca** efficiency, distributed learning, decentralized learning, nonconvex learning.I Introduction and related works
As modern society continues to evolve into a more interconnected world, and large-scale applications become increasingly prevalent, the interest in distributed learning has grown significantly [1, 2, 3]. In the distributed scenario, a group of agents, each with its own objective function, collaborate to optimize a global objective. To achieve the goal, two popular gradient descent based methodologies are commonly employed in algorithm design [4, 5, 6, 7]. The first one, known as the centralized method, requires all agents to send their data to a central server that manages all computations. In the second approach, which is called decentralized, agents are connected by a graph topology, and they process data locally and exchange information with their neighbors. Furthermore, the terms consensus and diffusion refer to two prominent decentralized methods in the optimization and machine learning communities [8, 9, 10, 1, 11, 12, 13, 14]. It is widely known that decentralized implementations offer enhanced data privacy, resilience to failure and reduced computation burden compared to the centralized approach [6, 9, 15, 16, 17, 18, 19].
![Refer to caption](extracted/5698508/cifar100/test_comp_ring_512_error_small.png)
As for the fundamental properties of the three strategies including centralized, consensus and diffusion, the works by [6, 20, 11, 21, 15, 22, 23] provide convergence guarantees for convex and non-convex environments. The results generally suggest that decentralized operations could, at best, only match the optimization performance of the centralized approach [6, 24, 25], or potentially degrade the generalization ability of models [26, 10, 27]. Nevertheless, the work [28] empirically demonstrated that the network disagreement, i.e., the distance among agents, in the middle of decentralized training can enhance generalization over centralized training. Afterwards, the work [9] provably verified how the consensus strategy enhances the generalization performance of models over centralized solutions in the large-batch setting from the perspective of sharpness-aware minimization (SAM), which is primarily designed to reduce the sharpness of machine learning models [29]. Specifically, the results in [9] showed that the network disagreement among agents, which changes over time, during training with the consensus method implicitly introduces an extra regularization term related to SAM to the original risk function, which helps drive the iterates to flatter models that are known to generalize better. However, as Figure 1 shows, flatter models obtained by consensus do not always guarantee better test accuracy in the context of distributed learning. This is because the final test accuracy depends on both generalization and optimization performance. Although decentralized training tends to favor flatter minima that generalize better than those found by the centralized method, it could also implicitly degrade the optimization performance in some cases. This observation motivates us to examine the behavior of centralized and decentralized strategies more closely and to compare more directly both their generalization and optimization abilities. As a result, this work discovers two interesting results. First, it shows that decentralized learning strategies are able to escape faster away from local minimizers and favor convergence toward flatter minima relative to the centralized solution. Second, and importantly, the ultimate classification accuracy is not solely dependent on the flatness of the local minimizer but also on how well a learning algorithm can approach that minimum. In other words, the classification accuracy is a function of both flatness and optimization performance. The paper examines the interplay between the two measures of flatness and optimization error closely. One important conclusion is that decentralized strategies of the diffusion type deliver enhanced classification accuracy because they strike a more favorable balance between flatness and optimization performance.
Notation. In this paper, we use boldface letters to denote random variables. For a scalar-valued function where is a scalar, we say that if for some constant . We say if as . For any matrix , signifies that the magnitude of the individual entries of X are or the norm of is .
I-A Related works
Understanding the optimization behavior of mini-batch gradient descent (GD) algorithms and their influence on the generalization of models has emerged as a prominent topic of interest in recent years. Large-batch training is increasingly important in distributed learning for its potential to enhance the training speed and scalability of modern neural networks. However, it has already been observed in the literature that small-batch GD tends to flatter minima than the large-batch version [30], and that flat minima usually generalize better than sharp ones [31, 32, 33, 34, 35, 36]. Intuitively, the loss values around flat minima change slowly when the model parameters are adjusted, thereby reducing the disparity between the training and test data [30]. The works by [37, 38, 39, 40, 34, 41] utilized the stochastic differential equation (SDE) approach as a fundamental tool to analyze the regularization effects of mini-batch GD. This type of analysis inherently introduces an extra layer of approximation error due to substituting the discrete-time update by a different equation. Moreover, to enable the analysis with SDE, it is necessary to make some extra assumptions on the gradient noise, such as being Gaussian distributed [34, 39], parameter-independent [37] or heavy-tailed distributed [42]. To overcome these limitations, another line of works [43, 44, 33] focused on the dynamical stability of mini-batch GD with discrete-time analysis. Specifically, they determined the conditions on the Hessian matrices to ensure that the distance to a local minimum does not increase, thereby stabilizing the iterates in the vicinity of a local minimum.
I-B Contribution
Contrary to the aforementioned works focusing on the single-agent case, we will study in this work the optimization bias of distributed algorithms in the multi-agent setting, which allows us to compare the performance of various distributed strategies. Inspired by [34], we start from investigating the esca** efficiency of algorithms from local minima, and then relate it to the trade-off between flatness and optimization. Our contributions are listed as follows:
(1) We propose a general framework to examine the esca** efficiency of distributed algorithms from local minima by staying in the discrete-time domain. Since the dependence of the Hessian matrix on the immediate iterate in the original recursion makes the optimization analysis intractable, it is necessary to resort to a short-term model where the original Hessian is replaced by one evaluated at some local minima. We rigorously verify that the approximation error between the short-term and true models is negligible, ensuring that the short-term model represents the true model accurately enough. Afterwards, we obtain closed-form expressions of the short-term excess risk which quantifies the esca** efficiency. Note that we follow a different discrete-time analysis from the dynamical stability approach used in [43, 44, 33] and which focused on studying the stability of algorithms. Similar to [34], our emphasis is on quantifying the extent to which algorithms can escape from local minima.
(2) According to the results obtained from (1), we compare the esca** efficiency of the centralized, consensus and diffusion methods. We show that decentralized approaches, i.e., consensus and diffusion, gain additional efficacy from the network heterogeneity and graph structure, enabling them to escape faster from local minimum than the centralized strategy. We further verify that consensus outperforms diffusion in terms of esca** ability from local minima. We also show that higher esca** efficiency encourages algorithms to favor flatter minimum by relating esca** efficiency to the flatness metric.
(3) If the additional power generated by the network heterogeneity and graph structure are not sufficiently strong to allow decentralized methods to successfully leave the current basin, then both decentralized and centralized methods will be stuck within the basin of the current local minimum. A similar rule also holds for the comparison between consensus and diffusion. This motivates us to pursue next a long-term analysis, which corresponds to the optimization performance. In this context, we verify that the extra power boosting the esca** efficiency could inversely deteriorate the optimization performance. This reveals the inherent trade-off between flatness and optimization, with diffusion showing superior optimization performance over consensus in terms of smaller estimation error.
(4) We finally illustrate the performance of centralized, consensus and diffusion training strategies on real data. In addition to the flatness and optimization performance of the three distributed methods, we show that diffusion achieves a favorable balance between flatness and optimization, thereby exhibiting superior test accuracy than the other two methods (i.e., centralized and consensus).
II Problem Statement
II-A Empirical risk minimization
Consider a collection of agents (or nodes) linked by a graph topology. Each agent receives a streaming sequence of data realizations arising from independently distributed observations. The agents wish to collaborate to solve a distributed learning problem of the following form:
(1) |
where is the dimension of the unknown vector , is the risk function for the -th agent, and is the aggregate risk across the graph. Each individual risk is defined as the empirical average of a loss function over a collection of training data points, namely,
(2) |
where is the possibly nonconvex loss function and the refers to training samples with feature vector and true label arising as realizations from a random source associated with agent .
Since we aim to understand the optimization behavior of algorithms around local minima of nonconvex risk functions, we need to distinguish between the local minima of and . Thus, we let denote a local minimizer for the aggregate risk , and let denote a local minimizer for the individual risk . If all agents in the network share the same risk function, i.e., , then , and we refer to the network as being homogeneous. In this case, all local minimizers of are also local minima of . Otherwise, we refer to the network as being heterogeneous, where the of all agents can be distinct among themselves and also in relation to . In this paper, we focus on this latter more general case.
II-B Gradient-descent algorithms
We consider three popular classes of algorithms that can be used to seek a solution for (1). The first algorithm is the mini-batch centralized method, in which all data from across the graph are shared with a central processor. At every instant , samples denoted by are selected uniformly at random with replacement from the training data available for each agent . Starting from a random initial condition , the central processor updates the estimate by using:
(3) |
Here, is the batch size and is a small positive step-size parameter. Observe that we are using boldface letters to refer to the data samples and iterates in (3); it is our convention in this paper to use the boldface notation to refer to random quantities.
The other two algorithms are of the decentralized type, where the data remains local and agents interact locally with their neighbors to solve (1) through a collaborative process. We consider two types of decentralized methods, which have been studied extensively in the literature, namely, the consensus and diffusion strategies [4, 5, 6, 7, 20, 24].
Before listing the algorithms, we describe the graph structure that drives their operation. The agents are assumed to be linked by a weighted graph topology. The weight on the link from agent to agent is denoted by ; this value is used to scale information sent from to . Each is non-negative and lies within ; it will be strictly positive if there exists a link from to over which information can be shared. We collect the into a matrix . In this paper, we consider a symmetric doubly-stochastic matrix : the entries on each column of are normalized to add up to .
Assumption II.1.
(Strongly-connected graph). The graph is assumed to be strongly connected. This means that there exists a path with nonzero weights linking any pair of agents and, in addition, at least one node in the network has a self-loop with .
It follows from the Perron-Frobenius theorem [6] that has a single eigenvalue at 1. Moreover, if we let denote the corresponding right eigenvector, then all its entries are positive and they can be normalized to add up to :
(4) |
For the doubly-stochastic matrix , we have . This means that the entries of the Perron vector are identical.
The diffusion strategy consists of the following two steps:
(5a) | ||||
(5b) |
At every iteration , every agent samples data points and uses to update its iterate to the intermediate value . Subsequently, the same agent combines the intermediate iterates from across its neighbors using (5b). The symbol denotes the collection of neighbors of . It is useful to remark that the centralized implementation (3) can be viewed as a special case of (5a)–(5b) if we select the combination matrix as .
In comparison, the consensus strategy involves the following steps:
(6a) | ||||
(6b) |
In this case, the existing iterates are first combined to generate the intermediate value , after which (6b) is applied. Observe the asymmetry on the right-hand side in (6b). The starting iterate is , while the loss functions are evaluated at the different iterates . In contrast, the same iterate appears in both terms on the RHS of (5b). This symmetry has been shown to enlarge the stability range of diffusion implementations over its consensus counterparts in the convex case, namely, diffusion is mean-square stable for a wider range of step-sizes [45, 1, 7].
II-C Esca** efficiency from local minima
We first characterize the basin (or valley) of a local minimum:
Definition II.2.
(Basin of attraction [38].) For a given local minimizer , its basin of attraction (or the valley of ) is defined as the set of all points starting from which or as if the step size is sufficiently small and there is no noise in the gradient-descent algorithms.
We illustrate the basin of attraction related to in Figure 2, where is a bounded open set. Also, we use to denote the boundary of , according to which in Figure 2. If or , then we say the algorithm escapes from exactly.
![Refer to caption](extracted/5698508/conceptual.png)
To analyze the esca** behavior of learning algorithms from local minima, it is necessary to quantify their esca** ability. Inspired by the continuous-time definition of esca** efficiency for the single-agent case from [34], and by the metrics used to assess the optimization performance of stochastic learning algorithms [6, 7], we use the following metric to measure the discrete-time esca** efficiency for general non-convex environments.
Definition II.3.
(Esca** efficiency). Assume all agents start from points close to a local minimizer , the esca** efficiency over the network at iteration is defined by:
(7) |
The larger this value is, the higher the esca** efficiency will be away from . If the algorithm ultimately converges to , then the larger is, the worse the optimization performance will be.
assesses the average excess-risk value across the network, and it quantifies the deviation between the current model at iteration and the local minimum. For a fixed , a larger value of indicates an expected faster escape from the local minimum (since the iterate will be further away from it). To justify whether an algorithm escapes from a local basin or not, earlier studies [46, 47, 48] have established various criteria based on the analysis methods they use. They nevertheless generally adhered to the principle of the minimum effort to reach the boundary of the local basin. In a similar vein, we introduce the risk barrier, which is defined as the infimum of the risk gap between the boundary points and the local minimum:
(8) |
For example, the risk barrier of in Figure 2 is . When exceeds the corresponding risk barrier defined in (8), namely,
(9) |
then we say that the network escapes from on average. If, on the other hand, gets stuck around (or converges to) , then in that case would serve as a measure of the resulting optimization performance[6, 7].
II-D Flatness metrics
Next, we formally characterize the notion of flat minima. Consider a local minimum and a small drift around it. The change in the risk value can be approximated by
(10) |
where refers to the(positive definite) Hessian matrix of at . If the above change in the risk value is small then we refer to as a flat minimum. Otherwise, if the change in the risk value is large, then we refer to as a sharp minimum. Motivated by (10), various metrics related to the Hessian matrix are applied in the literature to measure the flatness of local minima, e.g., the spectral norm [33, 43], the Frobenius norm [33, 44] and the trace [49, 33, 50]. In this paper, we use the trace of the Hessian as a flatness measure, but the other two metrics may also be applied due to the norm equivalence, namely:
(11) |
Note that for highly ill-conditioned minima, which is commonly observed in over-parameterized neural networks [34, 51], the first largest eigenvalues of dominate the remaining eigenvalues where . In this case, the norm equivalence in (11) can be more tightly guaranteed since now the in the upper and lower bounds can be approximately replaced by . More insights related to the flatness measure can be found in [49, 50].
III Esca** efficiency of multi-agent learning
Motivated by the definitions in the last section, we now examine the performance of decentralized and centralized methods.
III-A Modeling conditions
We introduce the following commonly-used assumptions. First, since our purpose is to understand the behavior of the algorithm around the same local minima , we assume all agents start from points near .
Assumption III.1.
(Starting points of all agents). All models start their updates from initial points that are sufficiently close to , namely,
(12) |
By using Jensen’s inequality on (12), we have
(13) |
Since we will establish that mini-batch gradient descent algorithms approach an neighborhood of , it is reasonable to assume all agents start from the neighborhood, which is dominated by . This assumption is weaker than the one used in [34] where models should exactly start from .
Assumption III.2.
(Smoothness condition.) For each agent , the gradient of relative to is Lipschitz. Specifically, for any , it holds that:
(14) |
Next, consider the Hessian matrix of agent at denoted by:
(15) |
and the global Hessian matrix of at :
(16) |
Assumption III.3.
(Small Hessian disagreement). The local Hessian at is sufficiently close to the global Hessian, namely,
(17) |
with a small constant .
Assumption III.3 can be satisfied when the data heterogeneity among agents is sufficiently small. For example, if all agents observe independently and identically distributed data, and consider the stochastic Hessian defined by:
(18) |
Then by resorting to the matrix Bernstein inequality [52], for any , we have:
(19) |
Thus, when each agent collects sufficient amount of data, all local Hessian matrices are guaranteed to be close to the stochastic Hessian with high probability, from which we get:
(20) |
Intuitively, even when all agents independently collect data from different distributions, if the data distribution among agents are sufficiently close to each other, then (III-A) can still be satisfied.
III-B Properties of gradient noise
To enable the analysis, we describe properties associated with the gradient noise process for later use. For any , the stochastic gradient noise at agent at iteration is defined by the difference:
(21) |
where the term stochastic refers to the case (i.e., batch size equal to 1). In the mini-batch case, the gradient noise is instead given by:
(22) |
We denote the covariance matrix of the gradient noise by:
(23) |
which is symmetric and non-negative definite. Likewise, for the mini-batch case,
(24) |
Lemma III.4.
(Gradient noise terms). Let denote the filtration generated by the past history of the random process for all and . For any , we define the error vector
(25) |
Then, under assumption III.2, it holds that the gradient noise defined in has zero mean:
(26) |
while its second and fourth-order moments are upper bounded by terms related to the batch size as follows:
(27) | ||||
(28) |
where the nonnegative scalars are on the order of
(29) | |||
(30) |
Moreover, the covariance matrices (23) and (24) are scaled versions of each other:
(31) |
Proof.
See Appendix A. ∎
We can further relate the noise covariance matrix at and the Hessian matrix . In the single-agent case, when and negative log-likelihood losses are used, it is well-known that there is an exact equivalence between the Hessian and the gradient covariance matrices at local minima in stochastic risk minimization [34], and this relationship can also be approximately used in the context of empirical risk minimization [34, 53, 39]. By log-likelihood losses we mean choosing to be of the form
(32) |
where and represent the feature vector and label associated with . In the multi-agent case, we define the gradient covariance matrix of the global risk function at for the case as:
(33) |
Lemma III.5.
(Relation to Hessian matrix). Under Assumption II.1, if the negative log likelihood losses are used by all agents, it holds that
(34) |
Proof.
See Appendix B. ∎
Lemma III.5 establishes the approximate equivalence between and in the context of empirical risk minimization. Since negative log-likelihood losses, e.g., mean-square loss and cross-entropy loss, are popular in the machine learning community, we mainly focus on this category in this paper, but our framework can also be applied to other losses.
III-C Network performance analysis
We now proceed with the network analysis, which will enable us to derive expressions for the value of . We will then use these expressions to deduce properties about the behavior of the centralized and decentralized algorithms in relation to flat minima and esca** efficiency. To begin with, we follow the decomposition from [6] and note that the combination matrix admits an eigen-decomposition of the form:
(35) |
where the matrices and are:
(36) |
Here, is a diagonal matrix with elements from the second largest-magnitude eigenvalue to the smallest-magnitude eigenvalue of appearing on the diagonal, and . Consider the extended network version policy:
(37) |
where denotes the Kronecker product, and is the identity matrix of size . Then, satisfies
(38) |
where
(39) |
We further collect quantities from across the network into the block variables:
(40) |
where denotes a block column vector, and each Hessian matrix is defined by
(41) |
Using (37)–(III-C), we can rewrite algorithms (3), (5a)–(5b) and (6a)–(6b) using a unified description as follows:
(42) |
where
(43) |
and the choices for the matrices depend on the nature of the algorithm. For instance, for the consensus algorithm we set
(44) |
while for diffusion we set
(45) |
Moreover, the centralized method can be viewed as a special case of diffusion for which
(46) |
Unfortunately, the dependence of on makes the analysis with (42) intractable. To overcome this challenge, and inspired by [11, 21, 6], we introduce the alternative block diagonal matrix
(47) |
where the are evaluated at . Using in place of , we replace recursion (42) by the following so-called short-term model:
(48) |
This approximation naturally raises the following question: How accurate can the short-term model in (48) approximate the true recursion in (42)? This question can be answered by the following two lemmas, where we separately show the results for the decentralized and centralized methods over a finite time horizon.
Lemma III.6.
(Deviation bounds of decentralized methods). For a fixed small step size and local batch size such that
(49) |
where and , and under assumptions II.1, III.1, and III.2, it can be verified for consensus and diffusion that the second and fourth-order moments of are upper bounded in a finite number of iterations such that , namely,
(50) | ||||
(51) |
where . Also, the second-order moment of is upper bounded by
(52) |
Moreover, the approximation error caused by the short-term model in (48) is upper bounded by
(53) |
Lemma III.7.
(Deviation bounds of the centralized method). Under the same conditions of Lemma III.6, the second-order and fourth-order moments of , the second-order moment of , and the approximation error of the short term model, related to the centralized method are guaranteed to be upper bounded in a finite number of iterations . Basically, it holds that
(54) | ||||
(55) | ||||
(56) |
and
(57) |
Comparing the bounds in Lemmas III.6 and III.7 , we find that the upper bounds on the second-order moments associated with the decentralized methods incorporate extra terms, which are generated by the network heterogeneity and graph structure. We will discuss the effects of the terms on the esca** efficiency later. Moreover, Lemmas III.6 and III.7 demonstrate that and dominate for all methods when is sufficiently small. In other words, the approximation error between the mean square deviation of the short-term and true models can be omitted compared with the size of the true models. Furthermore, we can manipulate the bounds in Lemmas III.6 and III.7 to obtain expression for performance. Using (7) and (10), we can approximate the as follows:
(58) |
which means that can be evaluated by means of the short-term recursion. Thus, the original recursion in (42) can be replaced by (48) in the small step-size regime. More details about the equality (III-C) can be found in Appendix F.
Building upon the results of Lemmas III.6 and III.7, we next analyze the esca** efficiency of the three algorithms by using (48), and establish the following theorem.
Theorem III.8.
(Esca** efficiency of distributed algorithms). Consider a network of agents running distributed algorithms covered by the short-term model (48). Under assumptions II.1, III.1, III.2, and III.3, and after iterations with
(59) |
it holds that
(60) | ||||
(61) | ||||
(62) |
where
(63) | ||||
(64) | ||||
(65) |
and , , and represent the excess risk of the centralized, consensus and diffusion methods at iteration , respectively.
Proof.
See Appendix F. ∎
Theorem III.8 shows the esca** efficiency of the three algorithms around a local minimum over a finite time horizon. Basically, for all three methods, larger or smaller batch size (i.e., larger gradient noise) enable higher esca** efficiency. Also, as mentioned in Appendix F, for the algorithms to successfully exit the local basin they would require iterations. Thus, if the algorithms cannot leave the local minimum in iterations, then we say they cannot effectively escape the local basin and are therefore trapped in it. Furthermore, we observe that the network heterogeneity and graph structure implicitly introduce additional terms into decentralized methods. That is, for homogeneous networks where all agents share the same local minimum , we have
(66) |
which makes . Meanwhile, in the centralized setting, we have and . In these cases, the extra terms are 0. On the other hand, in heterogeneous networks, the nonzero terms lead to larger values for decentralized methods compared to the centralized strategy.
However, the difference between the centralized and decentralized methods is only significant in the large-batch regime. Specifically, if is not sufficiently large such that , then we have under which the additional terms in (61) and (62) will be dominated by the terms associated with the approximated error due to (48) and the initialization condition in Assumption III.1. However, as the batch size increases, the effect of the gradient noise will progressively decrease and the influence of the terms will become more prominent. In the extreme scenario when the full-batch gradient descent is applied, the centralized method becomes completely noise-free, which means that there is no noise to help the algorithm escape from local minima. Nevertheless, the noisy terms related to network heterogeneity and graph structure continue to exist in decentralized approaches.
When comparing the esca** efficiency of two methods from a local minimum, such as the centralized and diffusion methods, there are three possible cases on average: (1) Both and exceed the risk barrier , then the algorithms will leave the basin around the local minimum; (2) is lager than , while is smaller than . In this case, the network driven by diffusion escapes from the local minimum, while the centralized algorithm remains stuck in the current basin; (3) Both and are smaller than , then the two methods remain trapped in the current basin. These 3 situations illustrate how higher esca** efficiency corresponds to a higher likelihood of escape from a local minimum. It is also clear that a decentralized method is more effective at esca** from a local minimum than the centralized method in the large-batch regime. However, larger values of , while good for esca** efficiency, they nevertheless worsen the optimization performance. We will discuss this point later in the steady state when .
We can further compare the esca** efficiency of diffusion and consensus, and show that the extra appearing in leads to smaller excess risk for diffusion than consensus.
Corollary III.9.
(Smaller for diffusion). Under the same conditions of Theorem III.8, it holds that
(67) |
Proof.
See Appendix G. ∎
Corollary III.9 implies that consensus runs farther away from the local minimum than the diffusion strategy for the same number of iterations due to its worse ER performance. This translates into faster escape for consensus compared to diffusion.
IV Trade-off between flatness and optimization
In this section, we elaborate on the important trade-off between flatness and optimization.
Suppose we run a decentralized or centralized algorithm to solve a possibly nonconvex optimization problem. One useful question is to investigate where the algorithm would prefer to go if it escapes from the current basin. In other words, assume the algorithm escapes from the basin around some local minimum and starts evolving until it settles down in the basin of another local minimum, we would like to examine what preferential properties does this second minimum have relative to the earlier one. To answer this question, we will relate the esca** efficiency of algorithms measured by and the flatness of local minima measured by .
From expressions (60)–(62), and in the one-dimensional case where is a scalar, it is obvious that larger values of , i.e., sharper minima, enable higher esca** efficiency for all three methods. Thus, it is more likely for the three algorithms to escape from sharp minima than flat ones in the one-dimensional case. When it comes to the higher dimensions, it is generally intractable to clarify the relationship between and directly. Fortunately, motivated by the Markov inequality, we can appeal to an upper bound for denoted by . Specifically, according to the Markov inequality [7], the probability of the excess risk value to remain below a basis threshold satisfies:
(68) |
where, from (60)–(65), the upper bound for the dominate terms of the three methods can be derived:
(69) |
where represents the eigenvalue of . Also,
(70) |
and, similarly,
(71) |
Then, the upper bound variables for the three methods are given by
(72) | ||||
(73) | ||||
(74) |
Observe that these upper bounds on the esca** efficiency are positively correlated with the sharpness of the local minimum through . Substituting (72)–(74) into (68) separately, we find that flat minima where is small provide high probability guarantees that the algorithms stay around the corresponding local basin.
Combining the discussion of both one and higher-dimensional cases, we conclude that the three algorithms prefer to stay around flat minima, which means that they favor flatter basin if they successfully escape from a current minimum. Moreover, as discussed before, higher esca** efficiency makes decentralized algorithms more likely to escape from a local minimum than the centralized method. This fact further indicates that decentralized methods favor flatter minima than their centralized counterpart. However, as already noted before, higher esca** efficiency may deteriorate the optimization performance. This motivates us to analyze the excess-risk performance in the long run when , which corresponds to the optimization performance in the steady state.
Strictly speaking, the short-term model defined in (48) may not be valid for nonconvex risk functions in the long run. Fortunately, it has been rigorously verified that (48) can approximate well the true model (42) when in the strongly convex case [6]. Since here we want to examine the convergence behavior of algorithms around local minima given that the algorithms are stuck in the current basin, we can resort to a strong convexity assumption around local minima, for which the following result can be guaranteed.
Corollary IV.1.
(Steady-state excess risks). Consider a network of agents running distributed algorithms covered by (48). After sufficient iterations such that , under assumptions II.1, III.1, III.2 and III.3, and assuming the algorithms are already trapped in the basin of a local minimum and is locally strongly-convex around , it holds that
(75) | ||||
(76) | ||||
(77) |
and, moreover,
(78) |
Proof.
See Appendix H. ∎
In Corollary IV.1, the factors that help algorithms escape local minima are now seen to adversely affect the optimization performance. Again, the difference between decentralized and centralized methods is only significant in the large-batch regime. By integrating the findings of Theorem III.8 and Corollaries III.9 and IV.1, we deduce that network heterogeneity and graph structure inject additional noisy terms into decentralized methods and these facilitate their escape from sharp minima compared to the centralized method. However, these added noisy terms incorporate some deterioration into the long-term optimization performance. Furthermore, although consensus exhibits higher esca** efficiency, this comes at the expense of reduced optimization performance. This observation reveals an intrinsic trade-off between flatness and optimization within the context of multi-agent learning.
V Simulation results
In this section, we illustrate the performance of the three algorithms on the CIFAR10 and CIFAR100 datasets across different neural network architectures. We choose the ResNet-18 [54], WideResNet-28-10 [55] and DenseNet-121 [56] as the base neural network structures for the two datasets. As for the graph structure of the multi-agent system, we use the Metropolis rule [6] to randomly generate a doubly-stochastic graph with 16 nodes, and its structure is shown in Figure 3(a). In the main text, we present the results for ResNet-18, while the results for the other two neural networks are included Appendix J. In the decentralized experiments, the full training dataset is divided into subsets, and each agent can only observe one subset. The centralized setting is the same as the traditional single-agent learning. Moreover, we simulate three different local batch sizes including . Note that the distributed setting with local batch means the global batch is . As for the learning scheme, we rely on the linear decaying rule [54, 57], where the initial learning rate is divided by when the training process reaches and of the total epoch. That is, moderately small step sizes are applied first to search for flat minima, and then smaller ones are used to guarantee the stability (i.e., convergence) of algorithms.
![Refer to caption](extracted/5698508/toy/graph16.png)
![Refer to caption](extracted/5698508/toy/graph16_ring.png)
We first illustrate the flatness and optimization performance of the three algorithms on CIFAR10 and CIFAR100. On one hand, we visualize the loss landscape around the obtained models in Figure 4. To do so, we use the visualization method from [58] where for any model we compute using some random directions that match the norm of . It can be observed from Figure 4 that the decentralized methods arrive at models that are significantly flatter than centralized models. However, the difference in terms of flatness between diffusion and consensus is subtle. In other words, diffusion and consensus converge to models with similar flatness. On the other hand, the evolution of the training loss is shown in Figure 5, from which we observe that all three methods converge after sufficient iterations. Also, consensus consistently exhibits larger training loss in the steady state than diffusion especially when the local batch is 512, while the optimization performance of diffusion and centralized are comparable. Therefore, we conclude that diffusion enables flatter models than centralized without obviously sacrificing optimization performance. Note that we also observe some loss spikes in the training loss curves of the centralized method from Figure 5. One explanation for this phenomenon is that the centralized models may be oscillating around a sharp minima that is unstable when using moderately small step sizes. This suggests intuitively that decentralized methods aid in stabilizing the training process.
![Refer to caption](extracted/5698508/cifar10/losses_train_128_small.png)
![Refer to caption](extracted/5698508/cifar10/losses_train_256_small.png)
![Refer to caption](extracted/5698508/cifar10/losses_train_512_small.png)
![Refer to caption](extracted/5698508/cifar100/losses_train_128_small.png)
![Refer to caption](extracted/5698508/cifar100/losses_train_256_small.png)
![Refer to caption](extracted/5698508/cifar100/losses_train_512_small.png)
![Refer to caption](extracted/5698508/cifar10/train_comp_random_128_loss_small.png)
![Refer to caption](extracted/5698508/cifar10/train_comp_random_256_loss_small.png)
![Refer to caption](extracted/5698508/cifar10/train_comp_random_512_loss_small.png)
![Refer to caption](extracted/5698508/cifar100/train_comp_random_128_loss_small.png)
![Refer to caption](extracted/5698508/cifar100/train_comp_random_256_loss_small.png)
![Refer to caption](extracted/5698508/cifar100/train_comp_random_512_loss_small.png)
We next compare the test accuracy of the three methods on CIFAR10 and CIFAR100. To better show the performance of the three methods across different settings, we also simulate the ring graph, shown in Figure 3(b), in addition to the random graph structure. Note that the ring graph can be viewed as a specific case of a random graph generated by the Metropolis rule. Thus the random graph is more general. The simulation results are shown in Table I. In each configuration, we run the simulation 3 times with different seeds, and the final results exhibit the mean of the last 10 epochs in these repetitions. We observe from Table I that diffusion consistently outperforms the other two methods for each setting. In particular, in the cases of , consensus even performs worse than centralized on the CIFAR100 dataset which is attributed to the optimization issue demonstrated in Figure 5.
Dataset | Method | Graph | |||
CIFAR10 | Centralized | – | |||
Consensus | Random | ||||
Ring | |||||
Diffusion | Random | ||||
Ring | |||||
CIFAR100 | Centralized | – | % | ||
Consensus | Random | % | |||
Ring | |||||
Diffusion | Random | ||||
Ring |
Dataset | Method | Graph | |||
CIFAR10 | Centralized | – | 0.0892 | 0.0960 | 0.1053 |
Consensus | Random | 0.0743 | 0.0777 | 0.0686 | |
Ring | 0.0767 | 0.0760 | 0.0583 | ||
Diffusion | Random | 0.0718 | 0.0789 | 0.0805 | |
Ring | 0.0719 | 0.0764 | 0.0798 | ||
CIFAR100 | Centralized | – | 0.2976 | 0.3099 | 0.3126 |
Consensus | Random | 0.2849 | 0.2817 | 0.2027 | |
Ring | 0.2761 | 0.2764 | 0.1731 | ||
Diffusion | Random | 0.2861 | 0.2989 | 0.2977 | |
Ring | 0.2863 | 0.3029 | 0.2998 |
The simulation results in Table I can be interpreted by the trade-off between flatness and optimization performance. Note that our simulation results do not contradict the traditional view that flatter models generalize better [31, 32, 33, 34, 35]. We show the generalization gap measuring the difference between the training and test accuracy in all experiments in Table II, which demonstrates that flatter models found by decentralized methods enable smaller generalization gap than the centralized approach. However, we emphasize that the final test accuracy is determined by both optimization and generalization performance. When is 512 with CIFAR100 dataset, even though consensus enables better generalization, it simultaneously loses too much optimization performance. Thus its final test accuracy is worse than centralized. Similarly, diffusion achieves a favorable balance between flatness and optimization, thereby exhibiting superior test accuracy than the other two methods.
VI Conclusion
We analyzed the learning behavior of three popular algorithms around local minima in nonconvex environments. The results show that decentralized methods exhibit accelerated evasion from local minima in contrast to the centralized strategy. They also show that while consensus outperforms diffusion in terms of esca** ability around local minima, this feature nevertheless comes at the expense of a deteriorated optimization performance. As a result, consensus is observed to lead to lower classification accuracy than diffusion. In other words, the results in this paper highlight an important trade-off between esca** efficiency and optimization performance in the context of multi-agent learning.
Although we focused on the traditional mini-batch gradient descent implementation, one useful extension would be to consider other types of optimizers, such as stochastic GD momentum [59] and ADAM [40]. In addition, it has been observed in the single-agent case that the structure of the gradient noise can influence the esca** efficiency and stability of stochastic gradient algorithms [34, 44]. It would be useful to examine the nature of this effect in the multi-agent case.
Acknowledgments
We acknowledge the assistance of ChatGPT in improving the English expressions in this paper. We also appreciate Dr. Elsa Rizk for her valuable suggestions, which helped us present the results more clearly.
References
- [1] A. H. Sayed, “Adaptive networks,” Proceedings of the IEEE, vol. 102, no. 4, pp. 460–497, 2014.
- [2] T.-H. Chang, M. Hong, H.-T. Wai, X. Zhang, and S. Lu, “Distributed learning in the nonconvex world: From batch data to streaming and beyond,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 26–38, 2020.
- [3] P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch SGD: training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
- [4] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
- [5] A. G. Dimakis, S. Kar, J. M. Moura, M. G. Rabbat, and A. Scaglione, “Gossip algorithms for distributed signal processing,” Proceedings of the IEEE, vol. 98, no. 11, pp. 1847–1864, 2010.
- [6] A. H. Sayed, “Adaptation, learning, and optimization over networks,” Foundations and Trends in Machine Learning, vol. 7, pp. 311–801, 2014.
- [7] ——, Inference and Learning from Data. Cambridge University Press, 2022.
- [8] J. Chen and A. H. Sayed, “Distributed pareto optimization via diffusion strategies,” IEEE J. Sel. Top. Signal Process., vol. 7, no. 2, pp. 205–220, 2013.
- [9] T. Zhu, F. He, K. Chen, M. Song, and D. Tao, “Decentralized SGD and average-direction SAM are asymptotically equivalent,” in Proc. ICML, Honolulu, 2023, pp. 43 005–43 036.
- [10] T. Zhu, F. He, L. Zhang, Z. Niu, M. Song, and D. Tao, “Topology-aware generalization of decentralized SGD,” in Proc. ICML, Baltimore, 2022, pp. 27 479–27 503.
- [11] S. Vlaski and A. H. Sayed, “Distributed learning in non-convex environments - part I: agreement at a linear rate,” IEEE Trans. Signal Process., vol. 69, pp. 1242–1256, 2021.
- [12] M. Kayaalp, S. Vlaski, and A. H. Sayed, “Dif-MAML: Decentralized multi-agent meta-learning,” IEEE Open Journal of Signal Processing, vol. 3, pp. 71–93, 2022.
- [13] Z. Wang, F. R. M. Pavan, and A. H. Sayed, “Decentralized GAN training through diffusion learning,” in Proc. MLSP, Xi’an, 2022, pp. 1–6.
- [14] K. Yuan, S. A. Alghunaim, and X. Huang, “Removing data heterogeneity influence enhances network topology dependence of decentralized SGD,” Journal of Machine Learning Research, vol. 24, no. 280, pp. 1–53, 2023.
- [15] X. Lian, C. Zhang, H. Zhang, C. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,” in Proc. NeurIPS, Long Beach, 2017, pp. 5330–5340.
- [16] B. Ying, K. Yuan, H. Hu, Y. Chen, and W. Yin, “Bluefog: Make decentralized algorithms practical for optimization and deep learning,” arXiv preprint arXiv:2111.04287, 2021.
- [17] B. Ying, K. Yuan, Y. Chen, H. Hu, P. Pan, and W. Yin, “Exponential graph is provably efficient for decentralized deep training,” in Proc. NeurIPS, 2021, pp. 13 975–13 987.
- [18] T. Sun, D. Li, and B. Wang, “Decentralized federated averaging,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 4, pp. 4289–4301, 2023.
- [19] J. Xu, W. Zhang, and F. Wang, “A (dp)2 sgd: Asynchronous decentralized parallel stochastic gradient descent with differential privacy,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 11, pp. 8036–8047, 2021.
- [20] J. Chen and A. H. Sayed, “On the learning behavior of adaptive networks - part I: Transient analysis,” IEEE Trans. Inf. Theory, vol. 61, no. 6, pp. 3487–3517, 2015.
- [21] S. Vlaski and A. H. Sayed, “Distributed learning in non-convex environments - part II: polynomial escape from saddle-points,” IEEE Trans. Signal Process., vol. 69, pp. 1257–1270, 2021.
- [22] A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. U. Stich, “A unified theory of decentralized SGD with changing topology and local updates,” in Proc. ICML, vol. 119, 2020, pp. 5381–5393.
- [23] S. A. Alghunaim and K. Yuan, “A unified and refined convergence analysis for non-convex decentralized learning,” IEEE Trans. Signal Process., vol. 70, pp. 3264–3279, 2022.
- [24] J. Chen and A. H. Sayed, “On the learning behavior of adaptive networks - part II: performance analysis,” IEEE Trans. Inf. Theory, vol. 61, no. 6, pp. 3518–3548, 2015.
- [25] K. Yuan, S. A. Alghunaim, B. Ying, and A. H. Sayed, “On the influence of bias-correction on distributed stochastic optimization,” IEEE Trans. Signal Process., vol. 68, pp. 4352–4367, 2020.
- [26] T. Sun, D. Li, and B. Wang, “Stability and generalization of decentralized stochastic gradient descent,” in Proc. AAAI, 2021, pp. 9756–9764.
- [27] X. Deng, T. Sun, S. Li, and D. Li, “Stability-based generalization analysis of the asynchronous decentralized SGD,” in Proc. AAAI, Washington, 2023, pp. 7340–7348.
- [28] L. Kong, T. Lin, A. Koloskova, M. Jaggi, and S. U. Stich, “Consensus control for decentralized deep learning,” in Proc. ICML, 2021, pp. 5686–5696.
- [29] P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness-aware minimization for efficiently improving generalization,” in Proc. ICLR, 2021.
- [30] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, “On large-batch training for deep learning: Generalization gap and sharp minima,” in Proc. ICLR, Toulon, 2017.
- [31] K. Lyu, Z. Li, and S. Arora, “Understanding the generalization benefit of normalization layers: Sharpness reduction,” in Proc. NeurIPS 2022, New Orleans, 2022.
- [32] K. Gatmiry, Z. Li, C.-Y. Chuang, S. Reddi, T. Ma, and S. Jegelka, “The inductive bias of flatness regularization for deep matrix factorization,” in NeurIPS, New Orleans, 2023.
- [33] L. Wu and W. J. Su, “The implicit regularization of dynamical stability in stochastic gradient descent,” in Proc. ICML, Honolulu, 2023, pp. 37 656–37 684.
- [34] Z. Zhu, J. Wu, B. Yu, L. Wu, and J. Ma, “The anisotropic noise in stochastic gradient descent: Its behavior of esca** from sharp minima and regularization effects,” in Proc. ICML, Long Beach, 2019, pp. 7654–7663.
- [35] M. S. Nacson, K. Ravichandran, N. Srebro, and D. Soudry, “Implicit bias of the step size in linear diagonal neural networks,” in Proc. ICML, Baltimore, 2022, pp. 16 270–16 295.
- [36] Y. Jiang, B. Neyshabur, H. Mobahi, D. Krishnan, and S. Bengio, “Fantastic generalization measures and where to find them,” in Proc. ICLR, Addis Ababa, 2020.
- [37] S. Mandt, M. D. Hoffman, and D. M. Blei, “Stochastic gradient descent as approximate bayesian inference,” J. Mach. Learn. Res., vol. 18, pp. 134:1–134:35, 2017.
- [38] T. Mori, Z. Liu, K. Liu, and M. Ueda, “Power-law escape rate of SGD,” in Proc. ICML, Baltimore, 2022, pp. 15 959–15 975.
- [39] Z. Xie, I. Sato, and M. Sugiyama, “A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima,” in Proc. ICLR, 2021.
- [40] P. Zhou, J. Feng, C. Ma, C. Xiong, S. C. Hoi, and W. E, “Towards theoretically understanding why SGD generalizes better than adam in deep learning,” in Proc. NeurIPS, 2020.
- [41] S. Pesme, L. Pillaud-Vivien, and N. Flammarion, “Implicit bias of SGD for diagonal linear networks: a provable benefit of stochasticity,” in Proc. NeurIPS 2021, 2021, pp. 29 218–29 230.
- [42] U. Simsekli, L. Sagun, and M. Gürbüzbalaban, “A tail-index analysis of stochastic gradient noise in deep neural networks,” in Proc. ICML, Long Beach, 2019, pp. 5827–5837.
- [43] L. Wu, C. Ma, and W. E, “How SGD selects the global minima in over-parameterized learning: A dynamical stability perspective,” in Proc. NeurIPS, Montréal, 2018, pp. 8289–8298.
- [44] L. Wu, M. Wang, and W. Su, “The alignment property of SGD noise and how it helps select flat minima: A stability analysis,” in Proc. NeurIPS, New Orleans, 2022.
- [45] S. Tu and A. H. Sayed, “Diffusion strategies outperform consensus strategies for distributed estimation over adaptive networks,” IEEE Trans. Signal Process., vol. 60, no. 12, pp. 6217–6234, 2012.
- [46] A. Bovier, M. Eckhoff, V. Gayrard, and M. Klein, “Metastability in reversible diffusion processes. i. sharp asymptotics for capacities and exit times,” J. Eur. Math. Soc.(JEMS), vol. 6, no. 4, pp. 399–424, 2004.
- [47] H. Ibayashi and M. Imaizumi, “Why does SGD prefer flat minima?: Through the lens of dynamical systems,” in Proc. AAAI workshop: When Machine Learning meets Dynamical Systems: Theory and Applications, Washington, 2023.
- [48] T. H. Nguyen, U. Simsekli, M. Gürbüzbalaban, and G. Richard, “First exit time analysis of stochastic gradient descent under heavy-tailed gradient noise,” in Proc. NeurIPS, Vancouver, 2019, pp. 273–283.
- [49] K. Ahn, A. Jadbabaie, and S. Sra, “How to escape sharp minima with random perturbations,” 2024. [Online]. Available: https://arxiv.longhoe.net/pdf/2305.15659.pdf
- [50] K. Wen, Z. Li, and T. Ma, “Sharpness minimization algorithms do not only minimize sharpness to achieve better generalization,” in Proc. NeurIPS, New Orleans, 2023.
- [51] Z. Yao, A. Gholami, Q. Lei, K. Keutzer, and M. W. Mahoney, “Hessian-based analysis of large batch training and robustness to adversaries,” in Proc. NeurIPS , Canada, Montréal, 2018, pp. 4954–4964.
- [52] J. A. Tropp, “An introduction to matrix concentration inequalities,” 2015. [Online]. Available: https://arxiv.longhoe.net/abs/1501.01571
- [53] E. Barshan, M.-E. Brunet, and G. K. Dziugaite, “Relatif: Identifying explanatory training samples via relative influence,” in Proc. AISTATS, 2020, pp. 1899–1909.
- [54] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, Las Vegas, 2016, pp. 770–778.
- [55] S. Zagoruyko and N. Komodakis, “Wide residual networks,” in Proc. BMVC, York, 2016.
- [56] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. CVPR, Honolulu, 2017, pp. 2261–2269.
- [57] T. Lin, S. U. Stich, K. K. Patel, and M. Jaggi, “Don’t use large mini-batches, use local SGD,” in Proc. ICLR, Addis Ababa, 2020.
- [58] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,” in Proc. NeurIPS, Montréal, 2018, pp. 6391–6401.
- [59] K. Yuan, Y. Chen, X. Huang, Y. Zhang, P. Pan, Y. Xu, and W. Yin, “Decentlam: Decentralized momentum SGD for large-batch deep training,” in Proc. ICCV, Montreal, 2021, pp. 3009–3019.
Appendix A Proof for Lemma III.4: Properties associated with gradient noise
Proofs in this section follow from [6, 12, 7]. We first note that the mini-batch gradient is an unbiased estimator of the true gradient. That is, for any :
(79) |
Using (79) and the independence of data among agents, for any two agents and , and , we have
(80) |
Then, for we have
(81) |
where follows from Jensen’s inequality, follows from the following inequality:
(82) |
(83) |
and follows from the Lipschitz condition in Assumption III.2. Similarly, for the fourth-order moment of the stochastic gradient noise, we have:
(84) |
We now verify that the size of terms related to gradient noise decreases with the increase of batch size. For any , we verify how the second and fourth-order conditional gradient noise varies with batch size. Recall (A) and (A), for :
(85) |
where , while for general :
(86) |
where follows from the independence among data. As for the fourth-order conditional gradient noise, we prove it by induction. Assume
(87) |
then we have
(88) |
Combining (A), (A), (A) and (A), we can verify that the variance and fourth-order moment of the gradient noise is upper bounded by terms related to batch size and the difference between the current model and local minimizer denoted by :
(89) | ||||
(90) |
As for the gradient covariance matrices, we have
(91) |
where is due to the fact that all data are sampled independently at each agent.
Appendix B Proof for Lemma III.5
We establish the relationship between the gradient covariance and Hessian matrices. We extend the result from single-agent case [34] to multi-agent setting. For any , the negative log likelihood loss is expressed as
(92) |
where is the true label of . Then, in the stochastic risk minimization where the risk function is defined as:
(93) |
we have the following gradient covariance matrix:
(94) |
where follows from (33), and follows from (79), the independence among agents and the following equality:
(95) |
As for the Hessian matrix, we have
(96) |
where follows from the following equality:
(97) |
From (B) and (B), we have the exact equivalence relationship between and for the stochastic risk minimization:
(98) |
Strictly speaking, since the stochastic risk minimization samples the targets from all possible predictions while the empirical optimization use the specific training label for each sample, the equivalence need not hold exactly in the empirical settings. Fortunately, we can still approximate by with a Monte Carlo estimate based on the training set [53, 34], i.e., is sufficiently large at all agents. Thus, in this paper, we use
(99) |
Appendix C Proof for Lemmas III.6 and III.7: Upper bound for the second-order error moment
The proof in this section is similar to [6, 20, 24] with 2 important differences: we now focus on nonconvex (as opposed to convex) objective functions and on the finite-horizon (as opposed to infinite-horizon) case. To examine the size of , we first analyze the mean-square error associated with in (42) denoted by for later use, which can be verified that is upper bounded by a term associated with and gradient noise.
Since
(104) |
we have
(105) |
Also, by the mean-value theorem [6], we have
(106) |
Note that for the heterogeneous networks, we have
(107) |
Recall the error recursion in (42), since and are either or , we have
(108) |
from which we have
(109) |
where follows from (35). Multiplying (C) from the left by , the following recursion can be obtained:
(114) | ||||
(121) | ||||
(126) |
Note that according to (107), we have
(127) |
Also, consider the average of the Hessian matrices of all agents denoted by
(128) |
According to (108), the recursion (114) can be split as
(129) | ||||
(130) |
Note that in the centralized method, we have
(131) |
so that we only need to analyze the term for the centralized method. Here we start from decentralized methods.
Conditioning both sides of (129) on , we now analyse the second-order moments of the two terms separately. First, for , consider
(132) |
According to (129), we have
(133) |
where (a) follows from (79).
We bound the first term on the right hand of (133). Note that
(134) |
and we know from [6] where denotes the spectral radius of . Let , we have
(135) |
where and follow from Jensen’s inequality.
We now bound the second term of (133), which is related to the gradient noise. Consider
(136) |
we have
(137) |
Thus, we should bound . Fortunately, we have
(138) |
where follows from (89). Combining (133), (C), (C) and (137), we obtain
(139) |
We now analyse the size of . Consider
(140) |
for which we have
(141) |
where follows from the sampling independence among agents.
Also, consider , with (129) we have
(142) |
where follows from (79), follows from the Jensen’s inequality and (141), and follows from (C) and the Lipschitz condition in Assumption III.2.
Combining (139) and (C), we obtain
(151) |
Let
(152) |
and by iterating (151), we obtain
(157) |
To proceed, it is necessary to compute . Basically, since
(158) |
we have
(163) | ||||
(168) |
As for the matrix power ,
(171) |
for which by resorting to Lemma 2 in [11], with , we have
(172) |
then we have
(177) |
Then, substituting (163), (177) and (178) into (157), we obtain
(189) |
from which we have
(190) |
According to (49), we obtain
(191) |
where .
We finally examine the bounds of the centralized method. As mentioned via (131), in the centralized method, is always . Thus, we only need to analyze , for which according to (C), we have:
(192) |
By following the same logic with the derivation process from (171)–(191), for the centralized method, we have
(193) |
By comparing (191) and (193), we see that the decentralized methods have extra noisy terms related to the network heterogeneity and graph structure.
Appendix D Proof for Lemmas III.6 and III.7: Upper bound for the fourth-order moment
In this section, we analyze the size of for later use. We recall the relation of to in (129), and use the following equality:
(194) |
where and are two column vectors. When , with the Cauchy–Schwarz inequality, we have:
(195) |
We first analyze the size of . Let , we have
(196) |
where follows from and (195), and follows from Jensen’s inequality.
Then for , similar to the proof when analyzing the second-order error in Appendix C, we consider which is the spectral radius of and obtain
(197) |
where follows from and (195), and follows from Jensen’s inequality.
To proceed with the analysis for and , we should analyze the fourth-order moment associated with gradient noise. Recall (136) and (140), we have
(198) |
Also,
(199) |
where follows from Jensen’s inequality, and follows from . Moreover, we have
(200) |
Substituting into , we have
(201) |
where follows from the Jensen’s inequality such that
(202) |
Substituting , , and into (D), we have
(203) |
where in we apply the result of (189) and the following inequality:
(204) |
Similarly, for , we substitute , and into (D), and obtain
(205) |
Combining (D) and (D), we obtain
(214) |
Let
(217) |
and by iterating (214), we have
(224) |
for which with Assumption III.1 we have
(225) |
where follows from Jensen’s inequality. Also,
(226) |
Note that
(233) |
and
(238) |
so that
(241) |
Appendix E Proof for Lemmas III.6 and III.7: Approximation error of the short-term model
The argument is similar to [6] except that we now focus on nonconvex (as opposed to convex) risk functions. To clarify how far the algorithm can escape from a local minimum , it is necessary to assess the size of the distance between and rather than upper bound it. However, the dependence of on makes the analysis difficult. This motivates the short-term model in (48). Consider
(266) |
and recall the short-term model in (48):
(267) |
where the Hessian matrix is approximated by the Hessian matrix at , and .
In this section, we analyze the approximation error caused by the short-term model. Consider which measures the difference between the true model and the short-term model. Subtracting (42) and (267), we obtain:
(268) |
Note that according to the Taylor expansion technique, when is sufficiently close to , we have:
(269) |
from which we get
(270) |
so that
(271) |
We now analyze the size of . Substituting (35) into (268), we have
(272) |
Similar to appendix C, we multiply both sides of (E) by so that it can be decomposed as
(277) | ||||
(284) |
Consider
(285) |
which is positive-definite since it is the Hessian matrix of the global risk at a local minimum. Then we have
(286) |
Still, for the centralized method, we only need to analyze as now is always .
Next, we analyze . Let which is the spectral radius of , we have
(288) |
where follows from Jensen’s inequality. By combining (E) and (E), we obtain the recursion associated with the size of :
(295) |
where , and
(298) |
By iterating (295), we have
(301) |
for which it can be verified that
(306) |
and
(311) |
Again, similar to (172), we have:
(312) |
Then we substitute (D), (306), (311) and (312) into (301), and obtain
(321) |
from which we have
(322) |
Since
(323) |
where in we use the results of (191) and (322), then we have
(324) |
where in we use the results of and . Similar to (E), we have
(325) |
from which we have
(326) |
Combining (E) and (326), we have
(327) |
which means that the approximation error of replacing by can be omitted compared with the size of . In other words, carries sufficient information of .
Appendix F Proof for Theorem III.8
In this section, we derive closed-form expressions for the excess-risk performance of centralized, consensus and diffusion strategies over a finite time horizon . Specifically, we verify how far the algorithms can escape from the local minimum in terms of the risk function. To do this, we need to use the upper bounds in Lemmas III.6 and III.7, where centralized and decentralized methods have different expressions. For simplicity, we use to unify the results of decentralized and centralized methods in Lemmas III.6 and III.7, namely,
(331) |
where in the centralized method , while in decentralized methods .
For each agent, the excess risk corresponding to its model is:
(332) |
where follows from the Taylor expansion, and follows from (107). Then we have the following average excess risk (or esca** efficiency) across the network involving models of all agents:
(333) |
where follows from Taylor expansion and the following inequality:
(334) |
and follows from the following inequality which is similar to (E) and (326):
(335) |
(336) |
so that
(337) |
To proceed, we examine the size of . We start from the short-term model in (267) and iterate it,
(338) |
from which we proceed to examine the size of as follows:
(339) |
We first examine the term . To do so, it is necessary to compute :
(340) |
Recall that
(341) |
and consider
(342) |
then, it holds that
(343) | ||||
(344) |
Substituting (36) and (132) into (342), we obtain
(352) |
from which we have
(355) |
We appeal to the block matrix inversion formula:
(362) |
where the Schur complement is defined by
(363) |
Applying this formula to (355), we have
(364) |
and
(365) | ||||
(366) | ||||
(367) | ||||
(368) | ||||
(369) |
so that
(372) |
As for , we have
(377) |
We therefore obtain
(380) |
and also
(381) |
Note that for any two invertible matrices and , where and , i.e., dominates , we have
(382) |
which means that the inverse of the sum of two matrices can be well expressed by the inverse of the dominate matrix. Also, for any two vectors and , assume and , i.e., dominates , we have
(383) |
which means that the square of the sum of any two vectors can be well expressed by the square of the dominate vector.
We recall the term , and obtain:
(384) |
Note that we have:
(385) |
where follows from Assumption III.1 and (381), and follows from (49), then we have:
(386) |
We now examine . To do so, we substitute (340), (343), (344), (372), (380), (382) and (383) into , and obtain
(387) |
where follows from (343) and (344), follows from the following equality:
(388) |
and follows from the following equality:
(392) | |||
(395) | |||
(399) | |||
(402) | |||
(403) |
where follows from (107) and (108), and follows from (382) and (383), and in we apply the following inequality:
(404) |
where follows from Assumption III.3, and follows from the following equality:
(405) |
Substituting (F) into (386), we obtain