On the Convergence of Multi-objective Optimization under Generalized Smoothness
Abstract
Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning. Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard -smooth or bounded-gradient assumptions, which are typically unsatisfactory for neural networks, such as Long short-term memory (LSTM) models and transformers. In this paper, we study a more general and realistic class of -smooth loss functions, where is a general non-decreasing function of gradient norm. We develop two novel single-loop algorithms for -smooth MOO problems, Generalized Smooth Multi-objective Gradient descent (GSMGrad) and its stochastic variant, Stochastic Generalized Smooth Multi-objective Gradient descent (SGSMGrad), which approximate the conflict-avoidant (CA) direction that maximizes the minimum improvement among objectives. We provide a comprehensive convergence analysis of both algorithms and show that they converge to an -accurate Pareto stationary point with a guaranteed -level average CA distance (i.e., the gap between the updating direction and the CA direction) over all iterations, where totally and samples are needed for deterministic and stochastic settings, respectively. Our algorithms can also guarantee a tighter -level CA distance in each iteration using more samples. Moreover, we propose an efficient variant of GSMGrad named GSMGrad-FA using only constant-level time and space, while achieving the same performance guarantee as GSMGrad. Our experiments validate our theory and demonstrate the effectiveness of the proposed methods.
1 Introduction
There have been a variety of emerging applications of multi-objective optimization (MOO), such as online advertising [26], autonomous driving [18], and reinforcement learning [31]. Mathematically, the MOO problem takes the following formulation.
(1) |
where is the total number of objectives and is the -objective function given model parameters . Under the stochastic setting, , where denotes data samples. This problem is challenging due to the gradient conflict that some objectives with larger gradients dominate the update direction at the sacrifice of significant performance degeneration on the less-fortune objectives with smaller gradients. A variety of MOO-based methods have been proposed to mitigate this conflict and find a more balanced solution among all objectives. In particular, the multiple gradient descent algorithm (MGDA) [12] aims to find a conflict-avoidant (CA) update direction that maximizes the minimal improvement among all objectives and converges to a Pareto stationary point at which there is no common descent direction for all objective functions. This idea then inspired numerous follow-up methods including but not limited to CAGrad [24], PCGrad [34], GradDrop [8], FAMO [23] and FairGrad [3] with a convergence guarantee in the deterministic setting with full-gradient computations. The theoretical understanding of the convergence and complexity of stochastic MOO is not well-developed until very recently. [25] proposed stochastic multi-gradient (SMG) as a stochastic version of MGDA, and established its convergence guarantee. [36] analyzed the non-convergence issues of MGDA, CAGrad and PCGrad in the stochastic setting, and further proposed a convergent approach named CR-MOGM. More recently, [13] and [6] proposed single-loop stochastic MOO methods named MoCo and MoDo, and proved their convergence to an -accurate Pareto stationary point while guaranteeing an -level average CA distance111CA distance means the distance between the updating direction and the CA direction. Its formal definition can be found in Section 2.4 over all iterations. [32] proposed a double-loop algorithm named SDMGrad that enables to obtain an unbiased stochastic multi-gradient via a double-sampling strategy. They established the convergence of SDMGrad with a guaranteed -level CA distance in every iteration, which we call as iteration-wise CA distance.
However, all existing works are limited by the standard -smooth and bounded-gradient assumptions. Nevertheless, a recent study [35] indicates that such assumptions may not necessarily be true for the training of neural networks and an alternative -smoothness condition was observed and studied, which assumes the Lipschitz constant to be linear in the gradient norm and the gradient norm to be potentially infinite. The analysis of existing MOO methods cannot be generalized to this -smoothness directly due to the possible unbounded smoothness or gradient norm. In addition, all existing works [29, 35, 22, 21, 19, 11, 9] in generalized smoothness are limited to the single task problems, which are fundamentally different from the MOO problems. All of the above methods can not be directly generalized to the MOO tasks studied in this paper, since even though each single task is generalized smooth, the linear combination of these tasks is not necessarily generalized smooth. In this paper, we aim to fill this gap by proposing novel MOO algorithms, which not only converge under the generalized smoothness condition but also mitigate gradient conflict effectively with a guaranteed sufficiently small CA distance.
1.1 Our Contributions
We propose two single-loop MOO methods, Generalized Smooth Multi-objective Gradient descent (GSMGrad) and its stochastic variant SGSMGrad, and provide them with a comprehensive convergence analysis under the generalized smoothness condition in different settings. Our detailed contributions are listed below.
Weakest assumptions in MOO. In this paper, we investigate the -smooth assumption, where is a general non-decreasing function of gradient norm, and includes both the standard -smooth and -smooth assumptions as special cases. This assumption finds many applications, such as LSTM models [35], transformers [11], distributionally robust optimization [19] and higher-order polynomial functions [9]. In addition, we do not make any bounded-gradient assumption, which is required in previous analysis to ensure the bounded multi-gradient approximation. To the best of our knowledge, this is the first work to investigate generalized smoothness in MOO problems.
New single-loop algorithms. Both GSMGrad and SGSMGrad are easy to implement by updating the weights of objectives and model parameters simultaneously via a single-loop structure. A warm-start initialization sub-procedure is also introduced at the beginning of both algorithms to ensure a sufficiently small iteration-wise CA distance under the single-loop structure. We also propose a computation- and memory-efficient variant of GSMGrad named GSMGrad-FA by updating the objective weights using only forward passes of rather than gradient , which effectively reduces time and space to without hurting the performance guarantee.
Convergence analysis and optimal complexity. We provide a comprehensive analysis of our proposed algorithms under the generalized -smooth condition in both deterministic and stochastic settings. To achieve an -accurate Pareto stationary point and an -level average CA distance, we show that GSMGrad and SGSMGrad require and samples in the deterministic and stochastic settings, respectively. Furthermore, to achieve a more aggressive -level iteration-wise CA distance, GSMGrad and SGSMGrad require an increased number of samples, on the order of and respectively, in both deterministic and stochastic scenarios due to smaller step sizes and mini-batch data sampling. Typically, achieving an -level iteration-wise CA distance results in much higher sample complexity, such as in [13], in [6] and in [32] for non-convex stochastic setting. Moreover, we show that GSMGrad-FA achieves the same performance guarantee as GSMGrad.
Supportive experiments. Our experiments on the MTL benchmark Cityscapes [10] validate our theory and demonstrate the effectiveness of our proposed algorithms.
1.2 Related Works
Gradient-based multi-objective optimization. A variety of gradient manipulation techniques have emerged for simultaneous learning of multiple tasks. One prevalent category of methods adjusts the weights of various objectives according to factors such as uncertainty [20], gradient norm [7], and training complexity [17]. Methods based on MOO have garnered increased attention due to their systematic designs, enhanced training stability and model-agnostic nature. For instance, [30] framed Multi-Task Learning (MTL) as a MOO problem and introduced an optimization method akin to MGDA [12]. Afterward, many MGDA-based methods have been proposed to mitigate gradient conflict with promising empirical performance. Among them, PCGrad [34] avoids conflict by projecting the gradient of each task on the norm plane of other tasks. GradDrop [8] randomly drops out conflicted gradients. CAGrad [24] adds a constraint on the update direction to be close to the average gradient. NashMTL [27] and FairGrad [3] formulated MTL as a bargaining game and a resource allocation problem, respectively. Theoretically, [13] proposed a provably convergent stochastic MOO method named MoCo based on an auxiliary tracking variable for gradient approximation. [6] characterized the trade-off among optimization, generalization, and conflict avoidance in MOO. [32] proposed a stochastic MOO method named SDMGrad with a preference-oriented regularizer, and analyzed its convergence. However, all these works rely on the -smoothness and bounded-gradient assumptions. The details can be founded in Table 1. This paper focuses on the MOO problems with generalized -smooth objectives.
Method | Smoothness 11footnotemark: 1 | Assumption22footnotemark: 2 | Sample Complexity |
SMG [25] | (LS) | (BG) | N/A33footnotemark: 3 |
CR-MOGM[36] | (LS) | (BF), (BG) | |
MoCo[13] | (LS) | (BF), (BG) | |
MoDo[6] | (LS) | (BG) | |
SDMGrad[32] | (LS) | (BG) | |
SGSMGrad (this paper) | (GS) | N/A |
Generalized smoothness. The generalized -smoothness was firstly proposed by [35], which was observed from extensive empirical experiments in training neural networks. A clip** algorithm was developed by[35] and the convergence rate was provided. Later, [19] analyzed the convergence of a normalized momentum method. The SPIDER algorithm was also applied to solve generalized smooth problems in [29, 9], where [9] studied a new notion of -symmetric generalized smoothness, which includes -smoothness as a special case. Very recently, a new -smoothness condition was studied in [21, 22], which is the weakest smoothness condition and includes all the smoothness conditions discussed above. However, all the existing works on generalized smoothness are limited to single-task optimizations and the understanding of MOO is insufficient. This paper provides the first study of MOO under the generalized -smoothness condition.
2 Preliminaries
2.1 Generalized smoothness
The standard -smoothness condition is widely investigated in existing optimization studies [15, 16], which assumes a function to be -smooth if there exists a bounded constant such that for any , Nevertheless, recent studies show that in the training of neural networks such as LSTM models [35], transformers [11], distributionally robust optimization [19] and high-order polynomials functions [9], the standard -smoothness assumption does not hold. Instead, a generalized -smoothness assumption was observed and studied in the training of LSTM models in [35], which assumes that for any , This assumption implies the Lipschitz constant is potentially unbounded and reduces to the -smoothness if . Later, a more generalized assumption was proposed and studied in [21]:
Definition 1.
(-smoothness, Definition 1 in [21]). A real-valued differentiable function is -smooth if almost everywhere in , where is a continuous non-decreasing function.
The -smoothness is a special case of -smoothness, where . There is another definition of generalized smooth, which is widely used and is equivalent to the -smoothness:
Definition 2.
(()-smoothness, Definition 2 in [21]). A real-valued differentiable function is -smooth if 1) for any , and 2) for any , where for continuous functions , is non-increasing, is non-decreasing and is the Euclidean ball centered at with radius .
2.2 Pareto concepts in multi-objective optimization (MOO)
As described before, MOO aims to find points at which there is no common descent direction for all objectives. Considering two points , we claim that dominates if for all and . We say a point is Pareto optimal if it is not dominated by any other point. In other words, we cannot improve one objective without compromising another when we reach a Pareto optimal point. In the general non-convex setting, MOO aims to find a Pareto stationary point defined as follows.
Definition 3.
We say is a Pareto stationary point if . In practice, we call an -accurate Pareto stationary point if .
2.3 Multiple-gradient descent algorithm (MGDA) and its stochastic variants
Deterministic MGDA. One of the big challenges of MOO is the gradient conflict, i.e., the gradients of different objectives may vary heavily in scale such that the largest gradient dominates the update direction. As a result, the performance of those objectives with smaller gradients [34] may be significantly compromised. Towards this end, we tend to find a balanced update direction for all objectives. Thus, we consider the minimum improvement across all objectives and maximize it by solving the following problem
(2) |
where is the step size, is the update direction, and the first-order Taylor approximation is applied at . To efficiently solve the above problem in eq. 2, we substitute the following relation
(3) |
where is the probability simplex over , and the regularization term is to regulate the magnitude of our update direction. The solution to the problem in eq. 3 can be obtained by solving the following problem [32]
(4) |
The above approach has been widely used in e.g., deterministic MGDA and its variants such as CAGrad, and PCGrad [12, 34, 24].
Stochastic MGDA. SMG [25] is the first stochastic MGDA. It directly replaces the gradients with stochastic gradients and the update rule becomes
where is the estimate of based on the sample . However, this leads to a biased gradient estimation of the update direction , and thus it requires an increasing batch size. To solve this issue, another work MoCo [13] introduces a tracking variable as a stochastic estimation of the true gradient. Afterward, a double-sampling strategy is proposed by [6, 32] to generate a near-unbiased update direction.
2.4 Conflict-avoidant (CA) direction and CA distance
We call the update direction in eq. 4 the conflict-avoidant (CA) direction since it mitigates gradient conflict. Though it may not be feasible to calculate the exact CA direction, we aim to find an update direction to be close to the CA direction. Therefore, measuring the gap between the CA direction and the estimated update direction is important, which we define as the CA distance.
Definition 4.
is the CA distance between estimated update direction and CA direction .
The larger the CA distance is, the further the estimated update direction will be away from the CA direction, and the more conflict there will be. In single-loop algorithms, MoCo and MoDo [13, 6], the average CA distance over iteration is of the order of while the double-loop algorithm SDMGrad [32] guarantees an -order CA distance in every iteration. In our work, we analyze the CA distance in both cases and provide convergence results accordingly.
3 Single-loop Algorithms for MOO Under Generalized Smoothness
In this section, we present our main algorithms: Generalized Smooth Multi-objective Gradient descent (GSMGrad) and Stochastic Generalized Smooth Multi-objective Gradient descent (SGSMGrad), both are easy to implement with a simple single-loop structure. We also introduce an efficient variant of GSMGrad with constant-level computational and memory costs.
3.1 Generalized Smooth Multi-objective Gradient descent (GSMGrad)
We start to adopt MGDA in our method by computing an approximated weight and an update direction according to eq. 4 where is the iteration number. However, since the optimal weight of the convex function is not unique, we deal with this issue by adding an regularization term and the problem becomes
(5) |
Besides the benefit of a unique solution, adding an regularization term also makes Lipschitz continuous [13]. Note that may not be Lipschitz continuous because may not be positive definite. Nevertheless, the analysis of CA distance is difficult because may not be Lipschitz continuous. Thus, we will characterize the gap between and plus the change of after adding this regularization term. As a result, the update rules become Lines 4-5 in Algorithm 1. We first update by a projected gradient descent process and compute the update direction to update model parameters.
For our single-loop algorithm, CA distance is proportional to the term , which decreases as the algorithm iterates with some error terms controlled by appropriately chosen small step sizes. If we initialize randomly, will be a constant order, and so will the first CA distance. Meanwhile, we can only get an -order CA distance after a certain iteration number when takes an order. Thus, we add an extra warm start process using Algorithm 2 to guarantee the new is close enough to and a small level CA distance in every iteration. However, this warm start process is not needed if we only require a small averaged CA distance.
3.2 Stochastic Generalized Smooth Multi-objective Gradient descent (SGSMGrad)
Under the stochastic setting, our algorithm keeps the same structure, having a warm start process and an update loop if we aim to control the CA distance in every iteration. In Algorithm 2, we do the same projected gradient descent without using stochastic gradients. This is because we only need to compute once and reuse it in the whole loop, which does not bring a computational burden. Then in the update loop, we update the weight and model parameters accordingly. We use a double-sampling strategy here to make the weight gradient estimator unbiased [32] such that is a near-unbiased multi-gradient where and are independent and unbiased estimates of . Similarly, we do not involve a warm start process if we require the average CA distance to be small.
3.3 Fast approximation via Taylor expansion
Similarly to most MGDA-type algorithms, our methods require space and time to compute and store all task gradients at each iteration for updating the weight . This becomes a drawback when the number of tasks or the model size is large. Motivated by [23], one solution is to use the Taylor Theorem to approximate the gradient for updating the weight as
where is the remainder term and it takes the order , which can be made sufficiently small by adjusting the step size. Thus, we then propose GSMGrad-FA in Algorithm 4 (shown in Appendix B), where we update along the update direction to get following by the update rule of
(6) |
As a result, in the model parameters update process, we only require one backward process by calculating the gradient of w.r.t. without storing it, and additional forward processes to compute in the weight update process. This approach saves computational and memory costs in the practical implementation significantly. More importantly, we also provide a theoretical guarantee for this efficient method (in eq. 6).
4 Convergence Analysis under Average CA distance
In this section, we provide the theoretical results for Algorithms 1 and 3 without warm starts to obtain an -accurate Pareto stationary point, with the average CA distance over iterations in .
4.1 Deterministic setting
Assumption 1.
Each objective function is twice differentiable and lower bounded by .
Assumption 2.
Each objective function is -smooth defined in Definition 1, where is a continuous non-decreasing function such that is monotonically increasing for any .
These assumptions are the most relaxed ones in existing MOO works since they directly assume objective smoothness or gradient/function value boundness [24, 13, 27, 32, 6, 33, 3]. It also includes the widely studied standard -smoothness [28, 15, 16], -smoothness [35] as special cases. Moreover, for any and , our assumption even holds for function such that , where are limited to in [9].
We then provide our theoretical results. Let and be some constants such that where . Define . We then have the following convergence rate for Algorithm 1:
The full version with detailed constants and detailed proof can be found in Appendix C.1. Theorem 1 provides the first convergence rate to obtain an -accurate Pareto stationary point for MOO problems with -smooth objectives. Moreover, it achieves the optimal sample complexity in the order of for GD with a single standard -smooth objective[5]. The MOO problems with -smooth objectives are challenging due to two reasons: 1) is potentially unbounded in our -smoothness setting, making all existing analysis in MOO [24, 13, 27, 32, 6, 33, 3] not applicable. 2) the update of includes all gradient information from each task, making the existing adaptive methods for single generalized smooth functions invalid.
To solve the challenges in Theorem 1, we find that a bounded function value implies a bounded gradient norm. Thus in our proof, we use induction to show that with parameters selected in Theorem 1, for any and , we have that is upper bounded by . Consequently, for any , we have that , which solves the unbounded gradient norm problem in our generalized smoothness setting. Then we can show that converges.
Corollary 1.
Under the same setting in Theorem 1, .
4.2 Stochastic setting
In the stochastic setting, we assume that we have access to an unbiased stochastic gradient instead of the true gradient , where is the collected samples. To prove convergence, we have the following assumption.
Assumption 3.
There exists some such that for any .
Let be the -th collection of samples at time in the stochastic model and In this section, we choose for any . Define .
Let and be some constants such that and . Define the following random variables and , where are some constants and denotes . We have the following theorem:
Theorem 2.
The full version and detailed proof can be found in Appendix C.3. When we set and , we can find an -stationary point with the optimal sample complexity in the order of for SGD with a single -smooth objective[1]. Note that in the proof of Theorem 1, we show for each and , we have that is bounded by applying a small constant step size and . However, this condition does not necessarily hold for our stochastic setting due to the unbounded gradient noise. To solve this problem, we introduce stop** time . The advantages are as follows: 1) for any , , we have that is bounded; 2) for any , the norm of gradient noise is bounded; 3) due to the optimal stop** theorem, for any and , we have that . Based on these properties, we can further get the following lemma:
Lemma 1.
Using the parameters selected in Theorem 2, we have that
(7) |
The proof of Lemma 1 is available in C.4. Lemma 1 indicates that is bounded by some constant and if with high probability, we have that . Note that The first two events are related to the gradient noise, where the probabilities can be bounded by Assumption 3 and Chebyshev’s inequality. The last event indicates that for some , we have Based on Lemma 1 and Markov inequality, we can show that and we can further show that . We then have that converges with high probability. Similar to Corollary 1, Theorem 2 also implies the average of CA distances converges with time with high probability.
5 Convergence Analysis under Iteration-wise CA distance
In Section 4 we show that the average CA distance is bounded under generalized smooth conditions. The average CA distance is also studied in MoCo [13] and MoDo [6] with the bounded gradient assumption and these works only focus on guarantees of the average CA distance over iterations. However, a -level average CA distance only implies the smallest CA distance to be -level. Since we want to keep the update direction close enough to the CA direction, it is better to have a tighter bound of CA distances. In this section, we show the CA distance is at every iteration with the help of a warm-start process and convergence results for Algorithms 1, 3, and 4.
5.1 Deterministic setting
Deterministic setting without fast approximation. We first provide results about bounded iteration-wise CA distance for Algorithm 1 with a warm start.
Theorem 3.
The finite time error bound and the full proof can be found in Appendix D.1. Since our parameters satisfy the requirements in the formal version of Theorem 1, we can find an -accurate Pareto stationary point with samples. In the analysis of CA distance, we show that the CA distance can be bounded by the term plus the strongly-convex constant . Meanwhile, there is a decay relation between and with some error terms controlled by step sizes. Nevertheless, the error terms will accumulate since we do telesco** on this decay relation, which will be the dominating term. Thus, step sizes have to be much smaller than the choices in Theorem 1 to guarantee iteration-wise small CA distance.
Deterministic setting with fast approximation. In this section, we show the convergence rate of Algorithm 4 and bounded iteration-wise CA distance.
The full version with detailed constants and proof can be found in the Section D.2. We can easily extend the analysis in Section C.1 on convergence analysis of Algorithm 4 because the only extra effort is dealing with the remainder term, which can be simply bounded by the smallest step size. As a result, the sample complexity remains the same to achieve a Pareto stationary point.
5.2 Stochastic setting
In this section, we show that Algorithm 3 with a warm start and mini-batches achieves a bounded iteration-wise CA distance with high probability. In this section, we choose where is the size of the mini-batch.
Theorem 6.
The full version with detailed constants and proof can be found in Appendix D.4. Since our parameters satisfy all requirements in Theorem 2, we can find an -accurate Pareto stationary point with high probability. Compared with Theorem 2, to guarantee an iteration-wise CA distance, despite our warm start process, a mini-batch method is required in our analysis. This is because given , the gradient is not unbiased. In Theorem 2, the optimal stop** theorem is applied which indicates that the expectation of the cumulative gradient is zero. However, for each iteration, this optimal stop** theorem does not hold and the estimated error is controlled by the size of the mini-batch. Then, the sample complexity to get a Pareto stationary point becomes due to necessary mini-batch .
6 Experiments
In this experiment, we evaluate the performance of the Cityscapes dataset [10], which involves 2 pixel-wise tasks: 7-class semantic segmentation (Task 1) and depth estimation (Task 2). Following the same experiment setup of [32], we build a SegNet [2] as the model, comparing the performance of MGDA [12], PCGrad [34], GradDrop [8], CAGrad [24], MoCo [13], MoDo [6], Nash-MTL [27], SDMGrad [32] with our methods, GSMGrad and GSMGrad-FA with the warm-start initialization. We utilize the metric to reflect the overall performance, which considers the average per-task performance drop versus the single-task (STL) baseline to assess methods. It can be observed in Figure 2 that GSMGrad has a better result in task 2 and a much more balanced performance. Meanwhile, the proposed GSMGrad-FA is much faster than GSMGrad as shown in Table 2 in the Appendix.
In addition, we also illustrate the relationship between the gradient norm and the local smoothness for each task. To do so, we compute them according to the method displayed in Section H.3 in [35]. We scatter the local smoothness constant against gradient norms in Figure 2 for the semantic segmentation task and depth estimation task in Figure 3 (in the appendix), respectively. Both results demonstrate a positive correlation between them, which further substantiates the necessity of our analysis. More experimental details can be found in Appendix A.
Method Segmentation Depth mIoU Pix Acc Abs Err Rel Err STL 74.01 93.16 0.0125 27.77 MGDA [12] 68.84 91.54 0.0309 33.50 44.14 PCGrad [34] 75.13 93.48 0.0154 42.07 18.29 GradDrop [8] 75.27 93.53 0.0157 47.54 23.73 CAGrad [24] 75.16 93.48 0.0141 37.60 11.64 MoCo [13] 75.42 93.55 0.0149 34.19 9.90 MoDo [6] 74.55 93.32 0.0159 41.51 18.89 Nash-MTL [27] 75.41 93.66 0.0129 35.02 6.82 SDMGrad [32] 74.53 93.52 0.0137 34.01 7.79 GSMGrad 75.41 93.46 0.0133 31.07 3.93 GSMGrad-FA 74.38 93.24 0.0160 41.78 19.44
![Refer to caption](extracted/5703021/scatter_plot_GS_task1.png)
7 Conclusion
In this paper, we investigate the multi-objective problem with a more challenging, relaxed and realistic -smooth assumption. We propose the first efficient MOO algorithm GSMGrad and its stochastic variant SGSMGrad for this problem. We provide the convergence guarantee for both algorithms to find an -accurate Pareto stationary point with -level average/iteration-wise CA distance. Extensive experiments are conducted to validate our theoretical results.
References
- [1] Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199(1):165–214, 2023.
- [2] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017.
- [3] Hao Ban and Kaiyi Ji. Fair resource allocation in multi-task learning. arXiv preprint arXiv:2402.15638, 2024.
- [4] Amir Beck and Marc Teboulle. Gradient-based algorithms with applications to signal recovery. Convex optimization in signal processing and communications, pages 42–88, 2009.
- [5] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i. Mathematical Programming, 184(1):71–120, 2020.
- [6] Lisha Chen, Heshan Fernando, Yiming Ying, and Tianyi Chen. Three-way trade-off in multi-objective learning: Optimization, generalization and conflict-avoidance. Advances in Neural Information Processing Systems, 36, 2024.
- [7] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning, pages 794–803. PMLR, 2018.
- [8] Zhao Chen, Jiquan Ngiam, Yan** Huang, Thang Luong, Henrik Kretzschmar, Yuning Chai, and Dragomir Anguelov. Just pick a sign: Optimizing deep multitask models with gradient sign dropout. Advances in Neural Information Processing Systems, 33:2039–2050, 2020.
- [9] Ziyi Chen, Yi Zhou, Yingbin Liang, and Zhaosong Lu. Generalized-smooth nonconvex optimization is as efficient as smooth nonconvex optimization. arXiv preprint arXiv:2303.02854, 2023.
- [10] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
- [11] Michael Crawshaw, Mingrui Liu, Francesco Orabona, Wei Zhang, and Zhenxun Zhuang. Robustness to unbounded smoothness of generalized signsgd. Advances in Neural Information Processing Systems, 35:9955–9968, 2022.
- [12] Jean-Antoine Désidéri. Multiple-gradient descent algorithm (mgda) for multiobjective optimization. Comptes Rendus Mathematique, 350(5-6):313–318, 2012.
- [13] Heshan Devaka Fernando, Han Shen, Miao Liu, Subhajit Chaudhury, Keerthiram Murugesan, and Tianyi Chen. Mitigating gradient bias in multi-objective learning: A provably convergent approach. In The Eleventh International Conference on Learning Representations, 2022.
- [14] Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods. arXiv preprint arXiv:2301.11235, 2023.
- [15] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- [16] Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267–305, 2016.
- [17] Michelle Guo, Albert Haque, De-An Huang, Serena Yeung, and Li Fei-Fei. Dynamic task prioritization for multitask learning. In Proceedings of the European conference on computer vision (ECCV), pages 270–287, 2018.
- [18] Xinyu Huang, Peng Wang, Xin**g Cheng, Dingfu Zhou, Qichuan Geng, and Ruigang Yang. The apolloscape open dataset for autonomous driving and its application. IEEE transactions on pattern analysis and machine intelligence, 42(10):2702–2719, 2019.
- [19] Jikai **, Bohang Zhang, Haiyang Wang, and Liwei Wang. Non-convex distributionally robust optimization: Non-asymptotic analysis. Advances in Neural Information Processing Systems, 34:2771–2782, 2021.
- [20] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491, 2018.
- [21] Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin, and Ali Jadbabaie. Convex and non-convex optimization under generalized smoothness. Advances in Neural Information Processing Systems, 36, 2024.
- [22] Haochuan Li, Alexander Rakhlin, and Ali Jadbabaie. Convergence of adam under relaxed assumptions. Advances in Neural Information Processing Systems, 36, 2024.
- [23] Bo Liu, Yihao Feng, Peter Stone, and Qiang Liu. Famo: Fast adaptive multitask optimization. Advances in Neural Information Processing Systems, 36, 2024.
- [24] Bo Liu, Xingchao Liu, Xiaojie **, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. Advances in Neural Information Processing Systems, 34:18878–18890, 2021.
- [25] Suyun Liu and Luis Nunes Vicente. The stochastic multi-gradient algorithm for multi-objective optimization and its application to supervised machine learning. Annals of Operations Research, pages 1–30, 2021.
- [26] Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1930–1939, 2018.
- [27] Aviv Navon, Aviv Shamsian, Idan Achituve, Haggai Maron, Kenji Kawaguchi, Gal Chechik, and Ethan Fetaya. Multi-task learning as a bargaining game. arXiv preprint arXiv:2202.01017, 2022.
- [28] Arkadij Semenovič Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983.
- [29] Amirhossein Reisizadeh, Haochuan Li, Subhro Das, and Ali Jadbabaie. Variance-reduced clip** for non-convex optimization. arXiv preprint arXiv:2303.00883, 2023.
- [30] Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. Advances in neural information processing systems, 31, 2018.
- [31] Philip S Thomas, Joelle Pineau, Romain Laroche, et al. Multi-objective spibb: Seldonian offline policy improvement with safety constraints in finite mdps. Advances in Neural Information Processing Systems, 34:2004–2017, 2021.
- [32] Peiyao Xiao, Hao Ban, and Kaiyi Ji. Direction-oriented multi-objective learning: Simple and provable stochastic algorithms. Advances in Neural Information Processing Systems, 36, 2024.
- [33] Haibo Yang, Zhuqing Liu, Jia Liu, Chaosheng Dong, and Michinari Momma. Federated multi-objective learning. Advances in Neural Information Processing Systems, 36, 2024.
- [34] Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33:5824–5836, 2020.
- [35] **gzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clip** accelerates training: A theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881, 2019.
- [36] Shiji Zhou, Wenpeng Zhang, Jiyan Jiang, Wenliang Zhong, **jie Gu, and Wenwu Zhu. On the convergence of stochastic multi-objective gradient manipulation and beyond. Advances in Neural Information Processing Systems, 35:38103–38115, 2022.
Appendix A Experimental details
A.1 Relation between gradient norms and the local smoothness
We show the relation between local smoothness and gradient norms of each task in this part. Both results demonstrate a positive correlation between them, which further substantiates the necessity of our analysis.
![Refer to caption](extracted/5703021/scatter_plot_GS_task1.png)
![Refer to caption](extracted/5703021/scatter_plot_GS_task2.png)
A.2 Running time comparison between GSMGrad and GSMGrad-FA
We compare the average running time of the proposed algorithms, GSMGrad and GSMGrad-FA. The time in Table 2 is an average of the total running time over epochs (in minutes). The result solidifies the advantage of the fast approximation.
Method | Average running time |
GSMGrad | 2.93 |
GSMGrad-FA | 1.93 |
A.3 Implementation details
Multi-task learning on Cityscapes dataset. Following the experiment setup in [32], we train our method for 200 epochs, using SGD optimizers for both model parameters and weights, and the batch size for Cityscapes is 8. We compute the averaged test performance over the last 10 epochs as the final performance measure. We fix the and do a grid search on hyperparameters including , and and choose the best result from them. It turns out our best performance of GSMGrad is based on the choice that , and . The choice of hyperparameters for GSMGrad-FA turns out to be the same as that for GSMGrad. All experiments are run on NVIDIA RTX A6000.
reflects the average per-task performance drop versus the single-task (STL) baseline to assess method . We calculate it by the following equation
where is the number of metrics, is the value of metric obtained by baseline , and obtained by the compared method . if the evaluation metric on task prefers a higher value and otherwise.
Generalized smoothness illustration. To illustrate the relation between gradient norms and local smoothness, we run SGD on each task separately without the warm start process. Since there is no weight update process, we only need to choose for both tasks.
Appendix B Algorithm
We show our GSMGrad with Fast Approximation (GSMGrad-FA):
Appendix C Detailed Proofs for Average CA Distance
C.1 Formal version and proof of Theorem 1
Let , and be some constants such that
(8) |
Define . We then have the following convergence rate for Algorithm 1 without warm start:
Theorem 7.
Proof.
Compared with the standard -smoothness, the generalized smoothness is more challenging to address due to the unbounded Lipschitz constant. Lemma 2 demonstrates that a bounded function value implies a bounded gradient norm, which further implies a bounded Lipschitz constant. In the following, we solve the unbounded Lipschitz constant problem by showing that the function value is bounded with the parameters selected in Theorem 1. We prove that for any and we have that by induction.
Base Case: since all are non-negative, according to (8) we have that holds for any .
Induction step: assume that for any and , we have that holds. We then prove that holds for any .
For , based on the monotonicity shown in Lemma 2, we have that . From assumption 2, we have that is -smooth by setting . For any , we have that
where the second inequality is due to and the last inequality is due to . Based on Assumption 2, Definition 2 and Lemma 3.3 in [21], we have the following descent lemma:
As a result, for any , we have that
(10) |
Based on the update process of , we have that
It then follows that
where the inequality is due to the non-expansiveness of projection. By rearranging the above inequality, we have that
(11) |
Plug (C.1) into (10), and we can show that
(12) |
Taking sums of (C.1) from to , for any we have that
(13) |
where the first inequality is due to and the last inequality is due to that and . Thus for any it can be shown that
since we have that and . Now we finish the induction step and can show that and (C.1) hold for all and .
Lemma 2.
(Lemma 3.5 in [21]) If a function is -smooth, we have that
(14) |
C.2 Proof of Corollary 1
Proof.
Recall that
where the first inequality follows from the optimal condition. Then we have
where we follow the same setting in Theorem 1. The proof is complete. ∎
C.3 Formal Version and Its Proof of Theorem 2
Let be some constants. Let and be some constants such that
where . Let be some constants.
For
,
,
and such that
(15) |
Define the following random variables
We then have the following theorem:
Theorem 8.
Proof.
Small probability of the event .
We first show that the probability of the event is small: . Note that
For any , we have that
where the last inequality is due to Chebyshev’s inequality. Based on the union bound, we have that
(16) |
since . Similarly, we have that
It follows that
(17) |
We then bound the probability of the event . Since we have that for some , .
According to (C.4) shown in Lemma 1, for any and we have that
where the first inequality is due to that , and the second one is due to and .
However, for some , . Thus for this task, we have that
According to Lemma 1, we have that
Based on Markov inequality, it follows that
(18) |
which indicates that . It follows that .
Convergence of . Based on (C.4) in Lemma 1, we have that
where the second inequality is due to and the last inequality is due to our selection of parameters. As a result, we have that
(19) |
where the first probability is due to Markov inequality. Thus we have that
where the last inequality is due to (16), (17), (18), and (19). This completes the proof. ∎
C.4 Proof of Lemma 1
Proof.
For all , we have which further implies that . Moreover, we have that for any and ,
Since is -smooth, it follows that
As a result, for any , we have that
(20) |
Based on the update process of , we have that
where the inequality follows from the non-expansiveness of projection. It follows that
(21) |
Combine (C.4) and (C.4), and we can get that
(22) |
Taking expectation and sum up (C.4) from to , we have that
(23) |
where the last inequality is due to that and for any , . By the optional stop** theorem, we have that
which further implies that
(24) |
Similarly, we have , and .
Appendix D Detailed Proofs for Iteration-wise CA Distance
We first provide some useful lemmas, which will be used in our main theorems.
Lemma 3 (Continuity of ).
Proof.
We first define that is the -th iterate of a function using projected gradient descent (PGD) with a constant step size . The update rule is . By the non-expansiveness of projection, we have
Since we set and , telesco** the above inequality from gives,
(26) |
Then according to the Cauchy-Schwartz inequality, it follows that
(27) |
where follows from eq. 26 and follows from the convergence of PGD (Theorem 1.1, [4]) on -strongly convex objectives that
Then lemma 3 can be bounded by
where the last inequality follows from and is -smooth by setting . The proof is complete. ∎
Lemma 4.
Given and , we have
Proof.
Recall that , then we have
By rearranging the above inequality, we have
Then recall that , we have
where the first inequlity follows from the optimality that
The proof is complete. ∎
Lemma 5.
Proof.
According to the Talyor Theorem, we have the following result for any objective function .
where is the remainder term. Then according to the descent lemma of each objective function , we have
Then we can obtain
Thus, according to the Cauchy-Schwartz inequality, we have
The proof is complete. ∎
D.1 Proof of Theorem 3
Theorem 9.
Proof.
Since our parameters satisfy all requirements in Theorem 1, we have that . According to the definition of CA distance, we have
(28) |
where follows from Cauchy-Schwartz inequality and follows from for any and Lemma 4. Then for the first term in the above inequality on the right-hand side (RHS), we have
(29) |
For the first term on the RHS in the above inequality, we have
(30) |
where follows from the non-expansiveness of projection and follows from properties of strong convexity and Cauchy-Schwartz inequality. Then for the second term on the RHS in eq. 29, we have
(31) |
Then for the last term on the RHS in eq. 29, we have
(32) |
where follows from the update rule in Algorithm 1, Lemma 3, and Young’s inequality. Then substituting section D.1, eq. 31 and section D.1 into eq. 29, we have
where the last inequality follows from Lemma 5 and . Then we do telesco** over
Then recalling that and substituting the above inequality into section D.1, we have
Since we run projected gradient descent for the strongly convex function in the N-loop in Algorithm 1, according to Theorem 10.5 [14], we have by choosing
Thus, as . CA distance takes the order of in every iteration by choosing , , and . The proof is complete. ∎
D.2 Formal Version and Its Proof of Theorem 4
Let , , and be some constants such that
We then have the following convergence rate for Algorithm 4.
Theorem 10.
Proof..
Following similar steps in Section C.1, we also prove that for any and , we have that by induction.
Base case: since all constants are non-negative, we have that holds for any .
Induction step: assume that for any and , holds. We then prove holds for any . Following similar steps in Section C.1, we have
(33) |
Based on the update rule of and non-expansiveness of projection, we have
where follows from Cauchy-Schwartz inequality and , and follows from Lemma 5. Then we have
Then substitute the above inequality into eq. 33, we can obtain
Then taking sums of the above inequality from to , for any , we have
(34) |
where the last inequality follows from . Thus, for any , it can be shown that
since we have that , , , . Now we finish the induction step and can show that and section D.2 hold for all and . Specifically, for , we have
Then following the choice of step sizes, we can obtain
The proof is complete. ∎
D.3 Formal Version and Its Proof of Theorem 5
Theorem 11.
Proof.
According to the definition of CA distance, we have
(35) |
where follows from Cauchy-Schwartz inequality and follows from for any and Lemma 3. Then for the first term in the above inequality on the right-hand side (RHS), we have
(36) |
For the first term on the RHS in the above inequality, we have
(37) |
where follows from the non-expansiveness of projection and follows from properties of strong convexity and Cauchy-Schwartz inequality. Then for the second term on the RHS in eq. 36, we have
(38) |
Then for the last term on the RHS in eq. 36, we have
(39) |
Then substituting section D.3, eq. 38 and section D.3 into eq. 36, we have
where the last inequality follows from Lemma 5 and . Then we do telesco** over
Then substituting the above inequality into section D.3, we have
Since we run projected gradient descent for the strongly convex function in the N-loop in Algorithm 4, according to Theorem 10.5 [14], we have
Thus, as . CA distance takes the order of in every iteration by choosing , , and . The proof is complete. ∎
D.4 Formal Version of Its Proof of Theorem 6
Let satisfy all requirements for Theorem 2 with . Moreover, for and and , we have the following theorem:
Theorem 12.
Proof.
When and , according to the definition of CA distance, we have
(40) |
where follows from Cauchy-Schwartz inequality and follows from for any and Lemma 4. We then show that for any , we have that by induction.
Base case: Since we run projected gradient descent for the strongly convex function in the N-loop in Algorithm 1, according to Theorem 10.5 [14], we have by choosing
Thus, as .
Induction: Assume we have that , we will show that holds for any in the following proof. We first divide into three parts:
(41) |
For the first term on the RHS in the above inequality, we have that
(42) |
where follows from the non-expansiveness of projection and follows from properties of strong convexity and Cauchy-Schwartz inequality. Taking the conditional expectation of (D.4), we have that for any ,
(43) |
where the last inequality is due to that for , and for any , we have that Then for the second term on the RHS in eq. 41, we have
(44) |
where the first inequality is due to Lemma 3, where . Then for the last term on the RHS in eq. 41, for any , we have that
(45) |
where follows from the non-expansiveness of projection and (D.4), and the last inequality is from (D.4). Then substituting section D.4, section D.4 and section D.4 into eq. 41, we have
(46) |
where the last inequality is due to that for any ,
and
According to (D.4), with and , we have that
We then complete our induction and prove that for any , we have that .
As a result, we have that
where the first probability is due to Markov inequality. Thus we have that
(47) |
where the last inequality is because our parameters satisfy all the requirements in Theorem 2, thus . Then based on (D.4), by setting , we have that with probability at least for each iteration , which completes the proof. ∎