Revisiting Decentralized ProxSkip: Achieving Linear Speedup††thanks: Corresponding author: **de Cao.
Abstract
The ProxSkip algorithm for decentralized and federated learning is gaining increasing attention due to its proven benefits in accelerating communication complexity while maintaining robustness against data heterogeneity. However, existing analyses of ProxSkip are limited to the strongly convex setting and do not achieve linear speedup, where convergence performance increases linearly with respect to the number of nodes. So far, questions remain open about how ProxSkip behaves in the non-convex setting and whether linear speedup is achievable.
In this paper, we revisit decentralized ProxSkip and address both questions. We demonstrate that the leading communication complexity of ProxSkip is for non-convex and convex settings, and for the strongly convex setting, where represents the number of nodes, denotes the probability of communication, signifies the level of stochastic noise, and denotes the desired accuracy level. This result illustrates that ProxSkip achieves linear speedup and can asymptotically reduce communication overhead proportional to the probability of communication. Additionally, for the strongly convex setting, we further prove that ProxSkip can achieve linear speedup with network-independent stepsizes.
1 Introduction
In this work, we consider the following decentralized optimization problem by a group of agents connected over a network:
(1) | ||||
where represent data distributions, which can be heterogeneous across nodes, is a smooth local function accessed by node . In this setup, a network of nodes (also referred to as agents, workers, or clients) collaboratively seeks to minimize the average of the nodes’ objectives. Solving problem (1) in a decentralized manner has garnered considerable attention in recent years [2, 3, 4, 5, 6]. These methods do not rely on a central coordinator and that communicate only with neighbors in an arbitrary communication topology. Nevertheless, decentralized optimization algorithms may still face challenges arising from communication bottlenecks.
To reduce communication costs, many techniques have been proposed. These techniques include compressing models and gradients [7, 8, 9, 10, 11, 12, 13], using accelerated scheme [14, 15, 16, 17, 18, 19], and implementing local updates [20, 21, 22, 23, 24]. By applying these strategies, it is possible to reduce the amount of information exchanged between different nodes during training, thereby improving the efficiency of distributed training setups.
In this work, we mainly focus on performing local updates as means to reduce communication frequency. In centralized settings (federated learning), local-SGD/FedAvg [22, 23, 25, 26] has emerged as one of the most widely adopted learning methods that employ local updates. However, when dealing with heterogeneous data, Local-SGD/FedAvg encounters the challenge of “client-drift.” This phenomenon arises from the diversity of functions on each node, causing each client to converge towards the minima of its respective function , which may be significantly distant from the global optimum . To tackle this issue, several algorithms have been proposed, including Scaffold [27], Scaffold with momentum [28], FedLin [29], FedPD [30], FedDyn [31], VRL-SGD [32], FedGATE [33], SCALLION/SCAFCOM [34]. In decentralized settings, local-DSGD has been introduced in [35]. Similarly to local-SGD, it also encounters the issue of client-drift when dealing with heterogeneous data. To mitigate the drift in Local-DSGD, several algorithms have been proposed. Notably, gradient-tracking (GT) based approaches, such as local-GT [39] and -GT [40], have been developed. Additionally, algorithms based on Exact-Diffusion/NIDS/D2 [41, 42, 43, 44], such as LED [45], have been introduced. Distinct from these periodic local updates methods [27, 29, 30, 35, 39, 40, 45], methods incorporating probabilistic local update are proposed such ProxSkip [46] and its extended versions, such as TAMUNA [47], CompressedScaffnew [48], VR-ProxSkip [49], ODEProx [50], and RandProx [51].
It is known that ProxSkip does not depend on the heterogeneity of the data and exhibit linear convergence on distributed strongly convex problems in the absence of stochastic noise [46]. When the network is sufficiently well-connected, ProxSkip [46] and its extensions [47, 48, 49, 50, 51] are gaining increasing attention due to their proven benefits in accelerating communication complexity. When deploying ProxSkip within the context of machine learning, it becomes imperative to comprehend its behavior on non-convex tasks and its susceptibility to stochastic noise. However, existing ProxSkip convergence analyses focus on convex settings, and the main limitation of the existing analyses is the inability to prove that linear speedup in terms of the number of nodes. Notice that although [50] presents the ODEProx algorithm and gives a more rigorous analysis of ProxSkip in the strongly convex setting, this new analysis shares the same limitation as the original ProxSkip analysis, namely, the inability to achieve linear speedup. Achieving linear speedup is highly desirable for a decentralized/federated learning algorithm as it enables effective utilization of the massive parallelism inherent in large decentralized/federated learning systems. Consequently, two fundamental open questions emerge:
(1) How does ProxSkip behave on non-convex tasks?
(2) Can we establish a linear speedup bound for ProxSkip in the presence of stochastic noise?
In this paper, we revisit ProxSkip for decentralized learning and provide answers to both questions. Specifically, we develop a new analysis with a novel proof technique under non-convex, convex, and strongly convex settings. Through this analysis, we obtain several new results that are comparable to the bounds of state-of-the-art decentralized algorithms while achieving linear speedup bounds.
We highlight our contributions as follows:
-
•
We establish the non-asymptotic convergence rate under stochastic non-convex, convex, and strongly convex settings of ProxSkip for problem (1). In particular, we prove that ProxSkip at iteration converges with rate
N-CVX/CVX: (2) S-CVX: (3) where is the stepsize of ProxSkip, denotes the variance of the stochastic gradient, is a topology-dependent quantity that approaches for a large and sparse network, is the strongly convex constant, and is a constant that depends on the initialization. To the best of our knowledge, it is the first work that establishes the convergence rate of probabilistic decentralized methods for non-convex settings. We offer a comparison of convergence rates of ProxSkip for problem (1) in Table 1.
-
•
We prove that, after enough transient time, the expected communication complexity of ProxSkip is (or for S-CVX), where denotes the desired accuracy level, demonstrating that ProxSkip achieves linear speedup with respect to the number of nodes . In addition, for the strongly convex setting, we further prove that ProxSkip can achieve linear speedup with network-independent stepsizes. The proposed new proof technique overcomes the analytical limitations of [46, 47, 48, 49, 50, 51]. To the best of our knowledge, we prove for the first time that ProxSkip can achieve linear speedup.
-
•
We elucidate the effects of noise, local steps, and data heterogeneity on the convergence of ProxSkip in stochastic non-convex, convex, and strongly convex settings. We demonstrate the robustness of ProxSkip against data heterogeneity while enhancing communication efficiency by local updates. Furthermore, we show that the convergence rates exhibited by ProxSkip in stochastic settings are comparable with those of existing state-of-the-art decentralized algorithms incorporating local updates [35, 40, 45] (see Table 2).
2 Setup
All vectors are column vectors unless otherwise stated. We let represent the local state of node at the -th iteration. For the sake of convenience in notation, we use bold capital letters to denote stacked variables. For instance,
2.1 Network graph
In this work, we focus on decentralized scenarios (undirected and connected network), where a network of nodes is interconnected by a graph with a set of edges , where node is connected to node if . To describe the algorithm, we introduce the global mixing matrix , where if , and otherwise. We impose the following standard assumption on the mixing matrix.
Assumption 1.
The mixing matrix is symmetric, doubly stochastic, and primitive. Let denote the largest eigenvalue of the mixing matrix , and the remaining eigenvalues are denoted as .
We introduce two quantities as follows: , where . Under Assumption 1, the matrix is positive semi-definite and doubly stochastic. Furthermore, we have , and is well-conditioned when is large.
2.2 Algorithm description
The ProxSkip algorithm [46] for problem (1) can be written as
(4a) | ||||
(4b) | ||||
(4c) |
Here, is the stepsize (learning rate), , with representing the stochastic gradient of , with probability and with probability , and is the control variate. At each iteration , communication takes place with a probability . In the absence of communication, the update is performed, while remains unchanged. This allows for multiple iterations of local computations to be performed between communication rounds. Decomposing the updates for individual nodes, we provide a detailed implementation in Algorithm 1.
2.3 Assumptions
We further use the following standard assumptions:
Assumption 2.
A solution exists to problem (1), and . Moreover, is -smooth, i.e.,
Assumption 3.
Each function is -strongly convex for constant , i.e.,
Assumption 4.
For all iteration , the local stochastic gradient is an unbiased estimate, i.e.,
and there exists a constant such that
3 Convergence results
We now present our novel convergence results for ProxSkip. In Section 3.1, we recall the existing results in [46]. In Section 3.2, the convergence rates and communication complexities for nonconvex and convex functions are presented Theorem 2 and Corollary 1, respectively. In Section 3.3, we prove further that ProxSkip can achieve linear speedup with network-independent stepsizes.
3.1 Preliminary
Theorem 1.
When , by setting and , we can deduce from (5) that the communication complexity of ProxSkip to achieve -accuracy, i.e., , is given by , where . If the network is sufficiently well-connected, i.e., , and we set , the iteration complexity becomes , achieving the optimal communication complexity as proven by [52].
When , based on (5) and the fact that , we can conclude that the local solution generated by ProxSkip converges to the global minimizer at a linear rate until it reaches an -neighborhood of . However, it is important to note that relying solely on equation (5) is not sufficient to achieve the desired linear speedup term . This indicates that the direct extension of the analysis techniques proposed in [46] or [51] to the stochastic scenario does not guarantee linear speedup, despite ensuring convergence. Therefore, further analysis is required to achieve the desired linear speedup.
3.2 Main theorem—Convergence rate of ProxSkip
We are now ready to present the new convergence results for ProxSkip.
Theorem 2.
Suppose that Assumptions 1, 2, and 4 hold. Let denote the iterates of Algorithm 1 and solves (1). For sufficiently small , , and , we have the following convergence results.
Non-convex: Let and . It holds that
(6) |
Convex: Let . Under the additional Assumption 3 with , it holds that
(7) |
Strongly convex: Under the additional Assumption 3 with , it holds that
(8) |
For the non-convex setting, Theorem 2 demonstrates that the ProxSkip algorithm converges to a radius around some stationary point. Without any additional assumptions, a stationary point is the best guarantee possible and is a satisfactory criterion to measure the performance of distributed methods with nonconvex objectives [35]. For the convex case, Theorem 2 shows that ProxSkip converges around some optimal solution. When , i.e., in the deterministic case, ProxSkip converges exactly with sublinear and linear rates for N-CVX/CVX and S-CVX settings, respectively.
Note that stochastic part in convergence rates (6), (7), and (8), which all can be rewritten as . It follows from Theorem 2 that
Thus, it is established in Theorem 2 that the linear speedup term can be achieved. When the stepsize is sufficiently small the term dominates convergence rates (6), (7), and (8), which improve linearly with the number of nodes .
Setting for sufficiently large for non-convex and convex settings, it holds that the rates are bounded by
For the strongly convex setting, letting for sufficiently large , it holds that , where denotes the exponential function, thus the rate is bounded by
When is sufficiently large, the term (or for the strongly convex setting) will dominate the rate. In this scenario, ProxSkip requires (or ) iterations to reach a desired -accurate solution, thus the convergence accuracy improves linearly with .
In addition, based on Theorem 2, we can even get a tighter rate by carefully selecting the stepsize to obtain the following result.
Corollary 1.
Same settings as in Theorem 1, we have the following convergence results.
Non-convex: It holds that
(9) |
expected communication rounds.
Strongly Convex: Under the additional Assumption 3 with , it holds that
(11) |
expected communication rounds. Here, the notation ignores logarithmic factors.
We provide Table 2 to compare the convergence results of ProxSkip with existing state-of-the-art decentralized optimization algorithms, such as local-DSGD [35], -GT [40], and LED [45], with local updates in terms of the number of communication rounds needed to achieve .
-
a
is function heterogeneity constant such that .
-
b
The results is for the non-convex setting, and no corresponding result is given for the convex setting.
Achieving acceleration by and . According to (9), (10), and (11), when is sufficiently small, the convergence rate is dominated by noise and is unaffected by the graph parameter for ProxSkip. After enough transient time, ProxSkip with (or for the strongly convex setting) achieves linear speedup by the probability of communication and the number of nodes .
Removing dependence on data heterogeneity. According to Table 2, the second term of the communication complexity of local-DSGD [35], a popular algorithm for decentralized optimization, is as follows:
Here, represents the function heterogeneity constant such that . We note that ProxSkip lacks the additional term ( for the strongly convex case). Thus, ProxSkip effectively eliminates dependence on the data heterogeneity level .
Comparable with existing decentralized algorithms incorporating local updates. When , we have ; when , . Highlighting the network quantities, the second and third terms of the communication complexity of ProxSkip is when , when . Compared with GT based methods [40], the network dependent bounds are improved. Let . Considering that the first term of the communication complexity of ProxSkip, local-DSGD [35], -GT [40], Periodical-GT [40], and LED [45] are (or for the strongly convex setting), where denotes the number of local steps, the convergence rates of ProxSkip are comparable with these existing decentralized algorithms incorporating local updates.
3.3 Achieving linear speedup with network-independent stepsizes
Theorem 3.
According to this rate, a linear speedup term of can be achieved. Importantly, the upper bound on the step size is independent of network topologies, making it a favorable property for practical implementation. Referring to Table 2, in the strongly convex setting, while local-DSGD [35], -GT [40], and LED [45] achieve linear speedup bounds, this property hinges on the requirement of network-dependent step sizes, wherein these step sizes are correlated with . In contrast, the step size condition for ProxSkip is , which remains independent of .
Notably, [53] for the first time prove that NIDS/ED/D2 can achieve linear speedup with network-independent stepsizes. However, it remains an open question whether, with local updates, a linear speedup bound can be achieved using network-independent stepsizes. Theorem 3 offers a positive response to this question.
3.4 Proof sketch of the main theorem
The existing convergence analysis of ProxSkip [46, 47, 48, 49, 50, 51] relies on primal-dual methodologies. Nevertheless, these analyses are limited to the use of first-order (stochastic) gradient information, leading to a suboptimal exploitation of the available function data. We propose a new proof that, in order to fully utilize function and gradient information, we use matrix factorization techniques to equivalently transform the iteration of ProxSkip into “SGD + consensus” form.
Here, we provide a proof sketch for Theorem 2 concerning non-convex objectives.
Step 1. (Lemma 1) We first give the equivalent form of update (4) as follows.
(13) |
where , , measures “consensus”, is related to the stochastic gradient, and measures “communication error” (, if ). This description may be less rigorous, but it helps to understand the proof more clearly. See Lemma 1 in Appendix for more details.
Step 2. (Lemma 2) Based on this equivalent update of ProxSkip and by the -smoothness of , we establish the following descent inequality.
Taking average for both sides over , we have
(14) |
Step 3. (Lemma 2) Subsequently, we establish the following consensus inequality.
where and . Unrolling this recurrence, we have
Step 4. Since , letting such that , it gives that . Combining it with (14), we complete the proof, i.e.,
4 Experiments
We empirically verify the theoretical results of ProxSkip for stochastic decentralized optimization. The experiment results for the deterministic case can be found in [46].
Setup. Similar as [46], we also demonstrate our findings on the logistic regression problem with a regularizer. The objective function is . Here, is the regularizer, any node holds its own training date , including sample vectors and corresponding classes . We use the dataset ijcnn1 from the widely-used LIBSVM library [54], whose attributes is and . Moreover, the training samples are randomly and evenly distributed over all the agents. We control the stochastic noise by adding Gaussian noise to every stochastic gradient, i.e., the stochastic gradients are generated as follows: , where and .
For all experiments, we first compute the solution to (1) by centralized methods, and then run over a randomly generated connected network with agents and undirected edges, where is the connectivity ratio. The mixing matrix is generated with the Metropolis-Hastings rule. All stochastic results are averaged over 10 runs.
Achieving linear speedup by and . We choose the regularizer to demonstrate the results in the convex setting. The results are shown in Fig. 1. The relative error is shown on the -axis. Here, we set , which independent of the network topology. We show the performance of ProxSkip at different number of nodes , network connectivity , and communication probability . The results show that, when the number of nodes is increased, the relative errors of ProxSkip is reduced under a constant and network-independent stepsize, which validates our results about linear speedup. Moreover, Fig. 1 shows that we can save on communication rounds by reducing , i.e., increasing the number of local steps reduces the amount of communication required to achieve the same level of accuracy.
Comparing with existing decentralized algorithms. We choose the regularizer , , and to demonstrate the results in the non-convex setting, where . In this case, we compare ProxSkip to the decentralized methods local-DSGD [35], -GT [40], and LED [45] for different local steps . We use the same stepsize for all algorithms. From Fig. 2, it shows that ProxSkip and LED perform similarly, and they outperforms the other methods as we increase the number of local steps.
5 Conclusion
This paper revisits the convergence bounds of ProxSkip for stochastic decentralized optimization. We present a new analysis with a novel proof technique applicable to stochastic non-convex, convex, and strongly convex settings. Through this comprehensive analysis, we derive several new results that rival the bounds of state-of-the-art decentralized algorithms [35, 40, 45]. We establish that the leading communication complexity of ProxSkip is , indicating that ProxSkip can achieve acceleration by and . Our proposed proof technique overcomes the analytical limitations of prior work [46, 47, 48, 49, 50, 51] and might be of independent interest in the community.
References
- [1]
- [2] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,” in Proc. Adv. Neural Inf. Process. Sys., pp. 5330–5340, 2017.
- [3] M Assran, N Loizou, N Ballas, M Rabbat, “Stochastic gradient push for distributed deep learning,” in Proc. Int. Conf. Mach. Learn., pp. 344–353, 2019.
- [4] A. Koloskova, T. Lin, and S. Stich, “An improved analysis of gradient tracking for decentralized machine learning,”, in Proc. Adv. Neural Inf. Process. Sys., pp. 11422–11435, 2021.
- [5] S. A. Alghunaim and K. Yuan, “A unified and refined convergence analysis for non-convex decentralized learning,” IEEE Trans. Signal Process., vol. 70, pp. 3264–3279, 2022.
- [6] L. Guo, X. Shi, S. Yang, and J. Cao, “DISA: A Dual inexact splitting algorithm for distributed convex composite optimization,” IEEE Trans. Autom. Control, doi: 10.1109/TAC.2023.3301289, 2023.
- [7] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” in Proc. Adv. Neural Inf. Process. Sys., 2017.
- [8] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signSGD: Compressed optimisation for non-convex problems,” in Proc. Int. Conf. Mach. Learn., pp.560-569, 2018.
- [9] J. Wangni, J. Wang, J. Liu, and T. Zhang, “Gradient sparsification for communication-efficient distributed optimization,” in Proc. Adv. Neural Inf. Process. Sys., 2018.
- [10] A. Koloskova, S. Stich, and M. Jaggi, “Decentralized stochastic optimization and gossip algorithms with compressed communication,” in Proc. Int. Conf. Mach. Learn., pp. 3478–3487, 2019.
- [11] S.P. Karimireddy, Q. Rebjock, S. Stich, M. Jaggi, “Error feedback fixes signSGD and other gradient compression schemes,” in Proc. Int. Conf. Mach. Learn., pp. 3252–3261, 2019.
- [12] I. Fatkhullin, A. Tyurin, and P. Richtárik, “Momentum provably improves error feedback!,” in Proc. Adv. Neural Inf. Process. Sys., 2023.
- [13] A. Tyurin and P. Richtárik, “2Direction: Theoretically faster distributed training with bidirectional communication compression,” in Proc. Adv. Neural Inf. Process. Sys., 2023.
- [14] A. Sadiev, D. Kovalev, and P. Richtárik, “Communication acceleration of local gradient methods via an accelerated primal-dual algorithm with inexact Prox,” in Proc. Adv. Neural Inf. Process. Sys., pp. 21777–21791, 2022.
- [15] D. Kovalev, A. Salim, and P. Richtárik, “Optimal and practical algorithms for smooth and strongly convex decentralized optimization,” in Proc. Adv. Neural Inf. Process. Sys., pp. 18342–18352, 2020.
- [16] H. Li, C. Fang, W. Yin and Z. Lin, “Decentralized accelerated gradient methods with increasing penalty parameters,” IEEE Trans. Signal Process., vol. 68, pp. 4855–4870, 2020.
- [17] H. Li, Z. Lin, and Y. Fang, “Variance reduced EXTRA and DIGing and their optimal acceleration for strongly convex decentralized optimization,” J. Mach. Learn. Res., vol. 23, 2022.
- [18] H. Hendrikx, F. Bach, and L. Massoulié,“An optimal algorithm for decentralized finite-sum optimization,” SIAM J. Optim., vol. 31, no. 4, pp. 2753–2783, 2021.
- [19] Z. Song, L. Shi, S. Pu, and M. Yan, “Optimal gradient tracking for decentralized optimization,” Math. Program., 2023. doi: 10.1007/s10107-023-01997-7.
- [20] T. Lin, S. Stich, K. K. Patel, and M. Jaggi “Don’t use large mini-batches, use local SGD,” in Proc. Int. Conf. Learn. Represent., 2018, arXiv:1808.07217. [Online]. Available: https://arxiv.longhoe.net/abs/1808.07217.
- [21] B. Woodworth, K. K. Patel, S. Stich, Z. Dai, B. Bullins, H. B. McMahan, O. Shamir, and N. Srebro, “Is Local SGD Better than Minibatch SGD?,” in Proc. Int. Conf. Mach. Learn., pp. 10334–10343, 2020.
- [22] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A.T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” 2016, arXiv:1610.05492. [Online]. Available: https://arxiv.longhoe.net/abs/1610.05492.
- [23] S. Stich, “Local SGD converges fast and communicates little,” in Proc. Int. Conf. Learn. Represent., 2018, arXiv:1805.09767. https://arxiv.longhoe.net/abs/1805.09767.
- [24] H. Yang, M. Fang, and J. Liu, “Achieving linear speedup with partial worker participation in non-IID federated learning,” in Proc. Int. Conf. Learn. Represent., 2021.
- [25] A. Khaled, K. Mishchenko, and P. Richtárik, “Tighter theory for local SGD on identical and heterogeneous data,” in Proc. Int. Conf. Artif. Intell. Statist., pp. 4519–4529, 2020.
- [26] J. Wang and G. Joshi, “Cooperative SGD: A unified framework for the design and analysis of local update SGD algorithms,” J. Mach. Learn. Res., vol. 22, no. 213, 2021.
- [27] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “SCAFFOLD: Stochastic controlled averaging for federated learning” , in Proc. Int. Conf. Mach. Learn., pp. 5132–5143, 2020.
- [28] Z. Cheng, X. Huang, and K. Yuan, “Momentum benefits non-IID federated learning simply and provably, ” in Proc. Int. Conf. Learn. Represent., 2024.
- [29] A. Mitra, R. Jaafar, G. J. Pappas, and H. Hassani, “Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients,” in Proc. Adv. Neural Inf. Process. Sys., pp. 14606–14619, 2021.
- [30] X. Zhang, M. Hong, S. Dhople, W. Yin, and Y. Liu, “FedPD: A federated learning framework with adaptivity to Non-IID data,” IEEE Trans. Signal Process., vol. 69, pp. 6055–6070, 2021.
- [31] A. E. Durmus, Z. Yue, M. Ramon, M. Matthew, W. Paul, and S. Venkatesh, “Federated learning based on dynamic regularization,” in Proc. Int. Conf. Learn. Represent., 2021.
- [32] X. Liang, S. Shen, J. Liu, Z. Pan, E. Chen, and Y. Cheng, “Variance reduced local SGD with lower communication complexity,” 2019, arXiv:1912.12844. https://arxiv.longhoe.net/abs/1912.12844.
- [33] F. Haddadpour, M. M. Kamani, A. Mokhtari, and M. Mahdavi, “Federated learning with compression: Unified analysis and sharp guarantees,” in Proc. Int. Conf. Artif. Intell. Statist., pp. 2350–2358, 2021.
- [34] X. Huang, P. Li, and X. Li, “Stochastic controlled averaging for federated learning with communication compression,” in Proc. Int. Conf. Learn. Represent., 2024.
- [35] A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A unified theory of decentralized SGD with changing topology and local updates,” in Proc. Int. Conf. Mach. Learn., pp. 5381–5393, 2020.
- [36] S. Pu and A. Nedić, “Distributed stochastic gradient tracking methods,” Math. Program., vol. 187, pp. 409–457, 2021.
- [37] A. Nedić, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM J. Optim., vol. 27, no. 4, pp. 2597–2633, 2017.
- [38] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Trans. Control Netw. Syst., vol. 5, no. 3, pp. 1245–1260, Sep. 2018.
- [39] E. D. H. Nguyen, S. A. Alghunaim, K. Yuan, and C. A. Uribe, “On the performance of gradient tracking with local updates,” 2022, arXiv:2210.04757, [Online]. Available: https://arxiv.longhoe.net/abs/2210.04757.
- [40] Y. Liu, T. Lin, A. Koloskova, and S. U. Stich, “Decentralized gradient tracking with local steps,” 2023, arXiv:2301.01313, [Online]. Available: https://arxiv.longhoe.net/abs/2301.01313.
- [41] K. Yuan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusion for distributed optimization and learning-part I: Algorithm development,” IEEE Trans. Signal Process., vol. 67, no. 3, pp. 708–723, Feb. 2019.
- [42] Z. Li, W. Shi, and M. Yan, “A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates,” IEEE Trans. Signal Process., vol. 67, no. 17, pp. 4494–4506, Sep. 2019.
- [43] H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “D2: Decentralized training over decentralized data,” in Proc. Int. Conf. Mach. Learn., pp. 4848–4856, 2018.
- [44] L. Guo, X. Shi, J. Cao, and Z. Wang, “Decentralized inexact proximal gradient method with network-independent stepsizes for convex composite optimization,” IEEE Trans. Signal Process., vol. 71, pp. 786–801, 2023.
- [45] S. A. Alghunaim, “Local exact-diffusion for decentralized optimization and learning,” 2023, arXiv:2302.00620, [Online]. Available: https://arxiv.longhoe.net/abs/2302.00620.
- [46] K. Mishchenko, G. Malinovsky, S. Stich, and P. Richtárik, “ProxSkip: Yes! Local gradient steps provably lead to communication acceleration! Finally!,” in Proc. Int. Conf. Mach. Learn., pp. 15750–15769, 2022.
- [47] L. Condat, I. Agarský, G. Malinovsky, and P. Richtárik, TAMUNA: Doubly accelerated federated learning with local training, compression, and partial participation, 2023, arXiv:2302.09832, [Online]. Available: https://arxiv.longhoe.net/abs/2302.09832.
- [48] L. Condat, I. Agarský, and P. Richtárik, “Provably doubly accelerated federated learning: The first theoretically successful combination of local training and communication compression,” 2023, arXiv:2210.13277, [Online]. Available: https://arxiv.longhoe.net/abs/2210.13277.
- [49] G. Malinovsky, K. Yi, and P. Richtárik, “Variance reduced ProxSkip: Algorithm, theory and application to federated learning,” in Proc. Adv. Neural Inf. Process. Sys., pp. 15176–15189, 2022.
- [50] Z. Hu and H. Huang, “Tighter analysis for ProxSkip,” in Proc. Int. Conf. Mach. Learn., pp. 13469–13496, 2023.
- [51] L. Condat and P. Richtárik, “RandProx: Primal-dual optimization algorithms with randomized proximal updates,” in Proc. Int. Conf. Learn. Represent., 2023.
- [52] K. Scaman, F. Bach, S. Bubeck, Y.-T. Lee, and L. Massoulié, “Optimal algorithms for smooth and strongly convex distributed optimization in networks,” in Proc. Int. Conf. Mach. Learn., pp. 3027–3036, 2017.
- [53] H. Yuan, S. A. Alghunaim, and K. Yuan, “Achieving linear speedup with network-independent learning rates in decentralized stochastic optimization,” Proc. in IEEE Conf. Decis. Control, pp. 139-144, 2023.
- [54] C.-C. Chang and C.-J. Lin, “LibSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, 2011, Art. no. 27.
Appendix
Appendix A Preliminaries
A.1 Basic Facts
The stochastic processes such as randomized communication and gradient estimation generate two sequences of -algebra. We denote by the -algebra of gradient estimation at -th iteration and the -algebra of randomized communication at the same step. The sequences and satisfy
With these notations, we can clarify the stochastic dependencies among the variables generated by Algorithmd 1, i.e., is measurable in and is measurable in .
The Bregman divergence of at points is defined by
It is easy to verify that . If is convex, from the definition of convex function, we have and . Thus
(15) |
For an -smooth and -strongly convex function , by [46, Appendix. A] we have
(16) | ||||
(17) |
Under the -smoothness condition, we have
(18) |
A.2 Notations
For any matrices and , their inner product is denoted as . For a given matrix , the Frobenius norm is given by , while the spectral norm is given by . Define the gradient and communication noise as
gradient noise: | |||
communication noise: |
We also define the following notations to simplify the analysis:
With Assumption 1 and [5, Section IV-B], the mixing matrix can be decomposed as
where , and matrix satisfies
Therefore, the matrix and can be decomposed as
(29) |
where , . Since for , it holds that and for .
Appendix B Proof of Theorem 2 and Corollary 1
B.1 Transformation and Some Descent Inequalities
Here, we introduce an auxiliary variable , where . It follows from (4b) and (4c) that, when and , . For any fixed point of update (4), it holds that , , . Thus, implies that , i.e., is a stationary point of problem (1). By this new variable, we give following error dynamics of Algorithm 1.
Lemma 1.
Suppose that Assumption 1 holds. There exists a invertible matrix and a diagonal matrix such that
(30a) | ||||
(30f) |
where ,
Moreover, we have
In addition, we have
(31) |
Proof.
See Appendix D. ∎
Based on Lemma 1, we give the following descent inequalities.
Lemma 2.
Proof.
See Appendix E. ∎
B.2 Convergence Analysis: Non-convex
Theorem 4.
Proof.
See Appendix F. ∎
Based on Theorem 2, we can even get a tighter rate by carefully selecting the stepsize similar to [35, Lemma 17], [40, Lemma C.13], or [45, Corollary 1].
Corollary 2.
Proof.
See Appendix G. ∎
When , we have . Choosing . Since in each iteration we trigger communication with probability , for any desired accuracy , the expected number of communication rounds required to achieve is bounded by:
When , we have . If we choose such that , then for any desired accuracy , the expected communication complexity is bounded by
B.3 Convergence Analysis: Convex
Theorem 5.
Proof.
See Appendix H. ∎
Similar as the analysis of non-convex setting, with Theorem 5, we have the following results.
Corollary 3.
Proof.
See Appendix I. ∎
Similar as the analysis of non-convex setting, when and choosing , for any desired accuracy , the expected communication complexity is bounded by
When , we have . If we choose such that , then for any desired accuracy , the expected communication complexity is bounded by
B.4 Convergence Analysis: Strongly Convex
Theorem 6.
Proof.
See Appendix J. ∎
Based on Theorem 6, we can even get a tighter rate by carefully selecting the step size similar to [35].
Corollary 4.
Proof.
See Appendix K. ∎
Similar as the analysis of non-convex and convex settings, we have if and if . Thus, for any desired accuracy , the expected number of communication rounds required to achieve is bounded by
and
Appendix C Proof Theorem 3
Then, we further prove ProxSkip can achieve linear speedup with network-independent stepsize. We introduce new iterates to facilitate the analysis. Similar techniques can be found, e.g., in [41, 42, 6], . Since , from (4b) and (4c), we have
Therefore, letting , we have the following equivalent form of ProxSkip (4) in the sense that they generate an identical sequence .
(45a) | ||||
(45b) | ||||
(45c) |
This equivalent form is more useful for the subsequent convergence analysis. The optimality condition of problem (1) is as the following lemma.
Lemma 3.
From Lemma 3, when , we have that any fixed point of (45) satisfies the condition (46). We also define the following notations to simplify the analysis:
where satisfies (46). Similar as Lemma 1, we give another error dynamics of ProxSkip, which will be used for proving the linear speedup with network-independent stepsizes of ProxSkip under strongly convexity.
Lemma 4.
Suppose that Assumption 1 holds. If and , there exist a invertible matrix and a diagonal matrix such that
(47a) | ||||
(47f) |
where is an arbitrary strictly positive constant,
Moreover, we have .
Proof.
See Appendix L. ∎
With this error dynamics, similar as Lemma 2, we give the following descent inequalities.
Lemma 5.
Proof.
See Appendix M. ∎
Theorem 7.
Proof.
See Appendix N. ∎
Appendix D Proof of Lemma 1
Proof.
It follows from (4b), and that
Since , it follows from (4b), (4c), and that
By , , and , we have
Note that . Algorithm 1 (update (4)) is equivalent to
which also can be rewritten as (since )
Multiplying both sides of the above by on the left, and using (29) and
we have
Let
where , and . Since the blocks of are diagonal matrices, there exists a permutation matrix such that , where
Setting , we have and can be rewritten as
It holds that , . Thus, the eigenvalues of are
Notice that when , which holds under Assumption 1 since , i.e., . For , the eigenvalues of are complex and distinct:
where . Through algebraic multiplication it can be verified that , where and
Note that
Since the spectral radius of matrix is upper bounded by any of its norm and , it holds that
Using and , we have
Let with . We have , where , i.e., there exists an invertible matrix such that , and
Therefore, we finally obtain (30). Moreover, we have
Then, we prove . Since
taking the squared norm, we have
On the other hand, noting that
it holds that
where and are the upper and lower blocks of . Then, it holds that
where we used since is a permutation matrix . ∎
Appendix E Proof of Lemma 2
Proof.
Proof of the descent inequality (32). Since is -smooth, setting and in (18), it gives that
From (30a), i.e.,
where , we have
Taking conditioned expectation with respect to and by
it holds that
Since , we have
Combining the last two equations and by , we get
By (31), i.e., , we have
Thus, the descent inequality (32) holds.
Proof of the inequality (33). Taking conditioned expectation with respect to , from (30f), we have
Since , , and , we have
Hence, it gives that
Taking conditioned expectation with respect to , and using the unbiasedness of , we have
(52) |
We first bound . Recall the definition of .
Note that and . It follows that
where is a permutation matrix . Therefore, we have
Then, using Cauchy-Schwarz inequality, , , and , we have
For any matrices and , it holds from Jensen’s inequality that for any . Therefore, letting , it holds that
Since , we have
Note that
It follows from this above inequality that
(53) |
can be bounded as follows: Note that
It follows that
By and -smoothness of , we have
(54) |
Since , it holds that
(55) |
Since , , and , it gives that
Then, substituting it into (55), we have
(56) |
Thus, combining (E) and (56), it holds that
(57) |
Then, we bound . Using , , and , we have
(58) |
Combining (52) and (E), we have
Therefore, combining it and (E), the inequality (33) follows.
Proof of the inequality (2). Let . It follows from (30a), i.e., , Assumption 4, and that
(59) |
It follows from the -smoothness of and and Jensen’s inequality that
(60) |
Then, we consider the bound of . Since is -smooth and -strongly convex, and , by (16), it gives that
(61) |
Substituting (E) and (E) into (E), and using , we have
(62) |
Since , it holds that
Combining with , we complete the proof. ∎
Appendix F Proof of Theorem 4
Proof.
From the condition of stepsize, we have
Then, it follows from the definition of (34) that
To ensure , we need to choose and such that
By solving these inequalities
and using , we have
Thus, it implies that if the condition of and in this Lemma holds, then .
Define the Lyapunov function
Since and , we have
Thus, it gives that
Then, since , by (32) and (33), it gives
where the last inequality holds because the condition (36) implies . Taking full expectation, we have
Summing it over , we can obtain
which implies that
Since , by [45, (75)], we have . Notice that . It holds that
(63) |
Using (63) and
we have
Since , we have , it holds that
i.e, (4) holds. ∎
Appendix G Proof of Corollary 2
Proof.
We derive a tighter rate by carefully selecting the step size similar to [35]. We rewrite (4) as
(64) |
where
(65) |
From the condition of stepsize, we have
Setting
we have the following cases.
- When and is smaller than both and , then
- When , then
- When , then
Combining the above three cases together it holds that
Appendix H Proof of Theorem 5
Proof.
Plugging into (33) gives
(67) |
Similar as Lemma 4, we know that
Define the Lyapunov function
Since and , we have
It gives that
Thus, according to (2), (67), and , we have
Taking full expectation, we have
(68) |
Summing the inequality (68) over , we can obtain
which implies that
(69) |
Since , similar as (63), we have
(70) |
Substituting (70) into (69) and using
we can derive that
Since , we have , it holds that
i.e., (5) holds. ∎
Appendix I Proof of Corollary 3
Proof.
Then, we derive a tighter rate by carefully selecting the step size similar to Corollary 2. From the condition of stepsize, we have
Similar as the proof of Theorem 4, it follows that
where
Then, the following rate can be obtained by following the same arguments used for the noncovex case,
i.e., (41) holds. ∎
Appendix J Proof of Theorem 6
Proof.
Recall (33)
and
where the last inequality follows from . Similar as Lemma 4, we know that
Since and , we have . Thus, it holds that
Then, it follows that
(79) |
Note that
Since , we can iterate inequality (79) to get
Taking the 1-induced-norm and using properties of the (induced) norms, it holds that
(80) |
where . We now bound the last term by noting that
where the last step holds for . Therefore,
Substituting the above into (80) and using and , we obtain
Since , by [45, (75)], we have . Note that . It holds that
Thus, we finally obtain (6). ∎
Appendix K Proof of Corollary 4
Appendix L Proof of Lemma 4
Proof.
Note that ProxSkip (45) has the following equivalently updates
(82a) | ||||
(82b) | ||||
(82c) |
We rewrite the recursion (82) into the following matrix representation:
Multiplying both sides of the above by on the left and using (29), we have
Since lies in the range space of , we have . By the structure of , we have
Therefor, it holds that
Let
where , and . Since the blocks of are diagonal matrices, there exists a permutation matrix such that , where
Setting , we have and can be rewritten as
Since
the eigenvalues of are
Consider the sign of . Note that is a quadratic function on , and
We have
Since and , it holds that
As a result, when , we have , i.e., . It implies that
where . Since , there exists a invertible such that , where . Using [5, Appendix B.2] and letting , we have
Since the spectral radius of matrix is upper bounded by any of its norm, , and , it holds that
Following a similar argument for , and using , we have
Let with . We have , where , i.e., there exists an invertible matrix such that , and
Moreover, we have . We thus complete the proof. ∎
Appendix M Proof of Lemma 5
Proof.
Proof of (48). It follows from (E) and that
Note that . We obtain
On the other hand, . It holds that
Therefore, we (48) follows.
Proof of (49). Taking conditioned expectation with respect to , it follows from (47f) that
Since , , and , we have
Hence, it gives that
Taking conditioned expectation with respect to , and using the unbiasedness of , we have
(83) |
Let . can be bounded as follows:
The last inequality holds due to , , and . For any vectors and , it holds from Jensen’s inequality that for any . Therefore, letting , it holds that
Then, we have
(84) |
In addition, we bound as follows:
(85) |
Therefore, substituting (84) and (M) into (83), we can conclude (49). ∎
Appendix N Proofs of Theorem 7
Proof.
From [51, eq. (27)], we have
(86) |
Then, recalling the definition of and , it gives that
Taking conditioned expectation with respect to , and using the unbiasedness of , we have
(87) |
By [51, Lemma 1], it gives that when and
(88) |
and . Combining with (87), it gives that
(89) |
Then, it follows from (N) and (89) that
Since , and , we have . It follows from that
Taking full expectation, and unrolling the recurrence, we have
(90) |
Note that
Since , we have
From (90), we have . Substituting it to (49), we get
(91) |
where and . Unrolling the recurrence (91), we have
(92) |
Since and , we have
Multiplying (46a) by and using (29), we have
Then, it holds that
(93) |
Combining (N) and (93), and using , it gives that
(94) |
Note that
We have , where . Substituting (94) into (48), taking full expectation, and unrolling the recurrence, we have
Note that . We have . Since and , where
we have
The linear speedup result (51) is thus proved. ∎