Towards Principled Task Grou** for Multi-Task Learning

Chenguang Wang Xuanhao Pan Tianshu Yu

Abstract

This paper presents a novel approach to task grou** in Multitask Learning (MTL), advancing beyond existing methods by addressing key theoretical and practical limitations. Unlike prior studies, our approach offers a more theoretically grounded method that does not rely on restrictive assumptions for constructing transfer gains. We also propose a flexible mathematical programming formulation which can accommodate a wide spectrum of resource constraints, thus enhancing its versatility. Experimental results across diverse domains, including computer vision datasets, combinatorial optimization benchmarks and time series tasks, demonstrate the superiority of our method over extensive baselines, validating its effectiveness and general applicability in MTL.

Machine Learning, ICML

1 Introduction

Multitask Learning (MTL) (Caruana, 1997; Zhang & Yang, 2021; Vandenhende et al., 2021) represents a forefront area in machine learning, aiming to improve learning efficiency and prediction accuracy by leveraging commonalities and differences across multiple tasks, reflected by the so-called intra-task “transfer gain”. Building upon this foundational concept, MTL has exhibited exceptional performance across a spectrum of domains, including computer vision (Standley et al., 2020; Fifty et al., 2021; Song et al., 2022; Sherif et al., 2023), NLP (Zhang et al., 2022b; Ding et al., 2023), Neural Architecture Search (Guo et al., 2020; Zhang et al., 2022a; Raychaudhuri et al., 2022; Yue et al., 2023), speech recognition (Zhang et al., 2019b; Huang et al., 2022) and combinatorial optimization problems (Wang & Yu, 2023; Wang et al., 2024). Central to the optimization of this framework is the concept of task grou**. Task grou** (Kang et al., 2011; Kumar & Daume III, 2012; Lee et al., 2016, 2018; Zamir et al., 2018; Dwivedi & Roig, 2019; Malhotra et al., 2022; Standley et al., 2020; Fifty et al., 2021; Song et al., 2022) in MTL involves strategically dividing a set of tasks into several groups, where each group encapsulates tasks that share maximal positive transfer while minimizing negative transfer.

Recent studies such as those by Standley et al. (2020); Fifty et al. (2021) have contributed significantly to this domain. Both works utilize a methodology where specific task affinities are collected in a single run of the MTL training, which is then used to group tasks based on the assumption of high-order approximation on the task relationships. Subsequently, these groups are trained separately using MTL methods. However, these approaches exhibit key limitations. Firstly, there is an absence of a theoretical guarantee in their task affinity measures, raising concerns about the reliability and predictability of the task grou** effectiveness. Secondly, they rely on an enumeration-based branch and bound algorithm for solving the task grou** problem. This approach not only sacrifices efficiency in terms of computational resources but also poses challenges in incorporating additional constraints, limiting its practical applicability in more complex and realistic scenarios.

In this work, we introduce a novel approach to task grou** in MTL that addresses these limitations and offers significant advancements over existing methodologies. First, we propose a theoretically grounded method to constructing transfer gains. Different from TAG (Fifty et al., 2021), which makes restricted assumptions like convexity and smoothness on loss functions, the proposed transfer gain is derived independent of any conditions. Furthermore, it maintains computational complexity at the same order as TAG by following the high-order approximation assumption regarding task relationships, as utilized in prior work (Standley et al., 2020; Fifty et al., 2021). Second, our work introduces a generic and flexible mathematical programming formulation to solve task grou** problems. This formulation can readily incorporate a variety of budget constraints – a critical aspect in real-world applications. By doing so, our method ensures the practicality and adaptability of MTL models in diverse scenarios, ranging from computational budget allocation to resource utilization considerations.

Our experimental evaluations across diverse domains, including computer vision datasets like CelebA, combinatorial optimization benchmarks and time series datasets, demonstrate the validity and generality of our proposed task grou** strategy in three-folds. First, we establish that our method consistently outperforms a wide range of baselines, encompassing single-task learning, multi-task learning, and various task grou** methods. This substantiates its effectiveness across these three diverse domains. Secondly, we illustrate the flexibility and effectiveness of our proposed mathematical programming formulation by introducing various constraints, mirroring real-world scenarios where resource budgets, such as GPU memory limitations and resource utilization, come into play. Our results demonstrate that our method significantly outperforms the baseline methods, showcasing its adaptability and performance improvement under such constraints. Finally, we provide a comprehensive ablation study, highlighting the superiority of our proposed transfer gain and the efficiency and effectiveness of our mathematical programming formulation comparing with TAG (Fifty et al., 2021).

In summary, this work makes several key contributions in the realm of task grou**:

•

We propose a theoretically principled approach to constructing transfer gains without relying on restrictive assumptions;
•

To solve task grou** problems, we introduce a mathematically generic and flexible programming formulation, capable of seamlessly integrating various budget constraints;
•

Through extensive experiments, we demonstrate the effectiveness of our task grou** strategy across diverse domains. Furthermore, we empirically showcase the flexibility of our mathematical programming approach by addressing realistic constraints.

2 Related Works

Task Grou**. Task grou** in multitask learning involves organizing tasks into groups based on their inherent relationships or similarities. The idea is to exploit shared knowledge within each group to improve overall learning efficiency. Most early works utilized normalization terms to partition the model parameters to align with the grou** of tasks (Kang et al., 2011; Kumar & Daume III, 2012; Lee et al., 2016). Lee et al. (2018) extends the scenario of that in Lee et al. (2016) to the deep learning area, aiming to model the asymmetric task relationships by an autoencoder. Zamir et al. (2018) presented “Taskonomy”, which disentangles task relationships and provides a more refined methodology for task grou** based on transfer learning hierarchies. Dwivedi & Roig (2019) introduced a novel approach to task taxonomy in multitask learning through representation similarity analysis, showing effectiveness and efficiency for generating task relationships on Taskonomy dataset. Malhotra et al. (2022) contributed to the task grou** discourse by introducing scheduled task mitigation which dynamically sequences task learning for improved multitask performance. Closely related to our work, Standley et al. (2020); Fifty et al. (2021) apply the two-stage methodology to solve task grou** problems: First collecting the training information and defining task affinities; Then using Branch and Bound algorithm to search for the best grou** results. For more recent works, Song et al. (2022) tried to solve task grou** problems through a Meta-Learning approach which creates a meta-learner to estimate task grou** gains. Sherif et al. (2023) proposed the data-driven method based on Data Map (Swayamdipta et al., 2020) to capture the task similarity on the fly which can relieve the scalability issue.

Lookahead Methods. The philosophy of Lookahead methods is using future information to guide the current state, which has been widely used in meta-learning (Finn et al., 2017; Nichol et al., 2018; Wang et al., 2020b), multitask learning (Fifty et al., 2021) and optimization techniques (Zhang et al., 2019a; Wang et al., 2020a; Zhou et al., 2021; Byun et al., 2022). In particular for multitask learning, Fifty et al. (2021) collected the one-step-forward loss information between task pairs for each gradient updating and constructed the overall task affinity matrix at the end of training. We follow this idea but construct a more principled lookahead-based metric.

Loss Balance. Numerous works have emerged to address multitask learning by exploring the balance on the losses from different tasks (Mao et al., 2021; Yu et al., 2020; Javaloy & Valera, 2022; Navon et al., 2022; Kendall et al., 2018; Liu et al., 2021a, b; Guangyuan et al., 2022; Liu et al., 2022). In these works, various loss reweighing mechanisms are designed to dynamically balance the importance of each task, which can relieve the negative transfer among tasks in terms of gradient information. Being plug-and-play techniques, loss balance methods can be flexibly applied to existing multitask learning frameworks.

3 Preliminary

Definition 3.1.

(Multitask Learning) Consider a set of tasks $\mathcal{T}=\{T_{i},i\in[n]\}$ , where $[n]=\{1,2,...,n\}$ and each task $T_{i}$ is associated with a learning objective $L_{i}(\phi,\theta_{i})$ , where $\phi\in\mathbb{R}^{p}$ represents the shared parameters across all tasks, and $\theta_{i}\in\mathbb{R}^{p_{i}}$ denotes the task-specific parameters for task $T_{i}$ . The objective is to jointly optimize a composite loss function that combines the individual task losses, allowing the model to learn salient information across tasks. The multitask objective can be formulated as:

\phi^{*},\theta_{1}^{*},\ldots,\theta_{n}^{*}=\arg\min_{\phi,\theta_{1},\ldots% ,\theta_{n}}\sum_{i=1}^{n}\lambda_{i}L_{i}(\phi,\theta_{i})

where $\lambda_{i}$ is a task-specific weighting parameter.

In contrast to the conventional approach of obtaining optimal parameters directly from the MTL objective, this work diverges by leveraging the relationships inherent in the tasks observed during the training process. Rather than pursuing the singular optimization of task-specific parameters, this methodology capitalizes on the insights derived from task relationships. This leads to the concept of “Task Grou**”, where tasks are organized based on their observed relationships.

Definition 3.2.

(Task Grou**) Let $\mathcal{T}=\{T_{i},i\in[n]\}$ denote the set of tasks, and $G=\{G_{j},j\in[m]\}$ represent the set of task groups. Task grou** endeavors to establish a map** based on task relationships such that for every task $T_{i}$ , there exists a group $G_{j}$ to which $T_{i}$ is assigned, ensuring the inclusion of at least one task in each group, resulting the best performance for each task.

The subsequent step involves obtaining optimal parameters at the level of task groups. The objective is:

\min_{\phi,\{\theta_{i}\}}\sum_{i\in G_{j}}\lambda_{i}L_{i}(\phi,\theta_{i})\ % \ \forall G_{j}\in G,

(1)

to enhance efficiency and effectiveness by jointly optimizing parameters within these groups.

4 Method

To infer task grou**s for subsequent optimization processes in Equation (1), we introduce a methodology for constructing transfer gains, as elucidated in Section 4.1, demonstrating efficacy in yielding theoretical outcomes without relying on underlying assumptions. Subsequently, we propose a versatile mathematical programming framework in Section 4.2 to flexibly accommodate various budget constraints. This formulation is instrumental in deriving the outcomes of task grou**. Furthermore, in Section 4.3, we conduct a detailed analysis of the computational complexity associated with collecting transfer gains, in comparison to TAG (Fifty et al., 2021).

4.1 Assumption-Free Transfer Gain

In this subsection, we first introduce the concept of proposed transfer gain, which is a pivotal component of our methodology.

Definition 4.1.

(Transfer Gain) For task $T_{i}\neq T_{j}$ , the task transfer gain from $T_{i}$ to $T_{j}$ at training step $t$ is characterized by

\mathcal{S}_{i\rightarrow j}^{t}=1-\frac{L_{j}\left(\phi_{\{i,j\}}^{t+1},% \theta_{j}^{t+1}\right)}{L_{j}\left(\phi_{\{j\}}^{t+1},\theta_{j}^{t+1}\right)}

(2)

In this equation, $L_{j}$ represents a task-specific metric, which can be, for example, the loss function or validation accuracy. $\phi_{\{i,j\}}^{t+1}$ and $\theta_{j}^{t+1}$ represent the model parameters trained by the $T_{i}$ and $T_{j}$ at the subsequent training iteration. Then we can define the group transfer gain from any $A\subseteq\mathcal{T}$ to task $T_{j}$ by

\mathcal{S}_{A\rightarrow j}^{t}=1-\frac{L_{j}\left(\phi_{A\cup\{j\}}^{t+1},% \theta_{j}^{t+1}\right)}{L_{j}\left(\phi_{\{j\}}^{t+1},\theta_{j}^{t+1}\right)}

(3)

Furthermore, we extend this concept to the group transfer gain from any $A\subseteq\mathcal{T}$ to $B\subseteq\mathcal{T}$ by

\mathcal{S}_{A\rightarrow B}^{t}=\sum_{j\in B}\mathcal{S}_{A\rightarrow j}^{t},

(4)

which allows us to measure the collective transfer knowledge between groups of tasks.

While our formulation in Equation (2) may bear a resemblance to prior work by Fifty et al. (2021), we will demonstrate the superior advantages of Equation (2) in the subsequent discussion. First, Fifty et al. (2021) defines the task affinity as $\mathcal{Z}_{i\rightarrow j}^{t}=1-\frac{L_{j}\left(\theta_{s\mid i}^{t+1},% \theta_{j}^{t}\right)}{L_{j}\left(\phi^{t},\theta_{j}^{t}\right)}$ w.r.t. the loss function, reflecting the effects of training $T_{i}$ on $T_{j}$ , while $\mathcal{S}^{t}_{i\rightarrow j}$ measures the effects of training $T_{i}$ on training $T_{j}$ . This distinction allows us to establish the relationship between loss decrease and the value of $\mathcal{S}^{t}_{i\rightarrow j}$ without additional assumptions. Specifically, if $\mathcal{S}^{t}_{i\rightarrow j}>\mathcal{S}^{t}_{k\rightarrow j}$ , then it is naturally true that training $\{T_{i},T_{j}\}$ results in more loss decrease than training $\{T_{k},T_{j}\}$ , as further summarized in the following observation:

Observation 4.2.

If $\mathcal{S}_{A_{1}\rightarrow i}^{t}>\mathcal{S}_{A_{2}\rightarrow i}^{t}$ , then training task group $A_{1}$ induces larger loss decrease than $A_{2}$ for task $i$ .

In contrast, Fifty et al. (2021) introduces restricted constraints, such as strong convexity on loss functions, to enforce this relationship. Secondly, in Fifty et al. (2021), group transfer gain is formulated as $\mathcal{Z}_{\{j,k\}\rightarrow i}^{t}=\frac{1}{2}\left(\mathcal{Z}_{j% \rightarrow i}^{t}+\mathcal{Z}_{k\rightarrow i}^{t}\right)$ . This formulation directly defines the group transfer gain on the task level, lacking the theoretical guarantee of the effectiveness. In this work, we establish the connection between task and group transfer gain in Proposition 4.3, leading to both the theoretical advantages and valid empirical operations at the implementation level.

Proposition 4.3.

Assume $\forall i$ , $L_{i}$ is an L-Lipschitz loss function and $0<C\leq L_{j}$ , then we have:

|\mathcal{S}_{A\rightarrow j}^{t}-\frac{1}{|A|}\sum_{i\in A}\mathcal{S}_{i% \rightarrow j}^{t}|\leq\frac{\eta_{t}(1+|A|)L^{2}}{C}

where $\eta_{t}$ is the learning rate at time step t.

We leave the proof in Appendix A.

Remark 4.4.

In practical scenarios, the magnitude of the task-specific loss function $L_{j}$ resides within a reasonable range sufficiently distant from zero. This can be achieved, for instance, through the incorporation of regularization techniques such as regularization terms or early stop** mechanisms. Specifically, the condition $0<C<L_{j},\forall j$ can be met through these regularization strategies. Additionally, the learning rate parameter $\eta_{t}$ is often set to small and gradually decreasing values during the training process. Furthermore, it is worth noting that an upper bound on the group size denoted as $|A|$ can usually be constrained to a reasonable magnitude. This limitation is often dictated by memory constraints or other practical considerations, and it can be bounded by a constant. Such constraints ensure that the group size remains within manageable limits, aligning with available resources and memory budgets. Consequently, it is reasonable to posit that the group transfer gain exhibits such an approximate characteristic.

Based on these facts, $\mathcal{S}_{i\rightarrow j}^{t}$ can serve as a measure of the intrinsic signals that reveal the relationship between tasks during training. Furthermore, we introduce the concept of cumulative transfer gain and group transfer across the training as follows:

\mathcal{S}_{i\rightarrow j}=\frac{1}{T}{\sum_{t=1}^{T}\mathcal{S}_{i% \rightarrow j}^{t}},\ \mathcal{S}_{A\rightarrow j}=\frac{1}{|A|}\sum_{i\in A}% \mathcal{S}_{i\rightarrow j}

(5)

to quantify the effects of training group $A$ on task $j$ across the training.

4.2 Genetic and Flexible Task Grou** Framework

After obtaining the cumulative transfer gain across the training from Equation (5), the remaining question is how to accurately determine the task grou** results. Practical applications often require customization based on various requirements, such as resource constraints. Therefore, it is crucial to formulate a general and flexible task grou** framework. In contrast to previous approaches (Zamir et al., 2018; Fifty et al., 2021; Standley et al., 2020) that utilize Binary Integer Programming and Branch & Bound methods , we propose a mathematical programming framework which can be transformed to Mixed Integer Programming formulation. This involves setting binary variables $X_{ij}$ , where $i\in[n]$ and $j\in[m]$ , with $X_{ij}=1$ indicating the assignment of task $i$ to group $j$ , and $X_{ij}=0$ otherwise. Notably, $X_{\cdot j}\in\mathbb{R}^{n}$ indicates $j$ -th column of $X$ , $\mathbf{1}$ represents a vector composed entirely of ones with an adaptable dimension, element $B_{ij}$ in $B\in\mathbb{R}^{n\times m}$ denotes the budget of $T_{i}$ assigned to $G_{j}$ , $\mathbf{b}\in\mathbb{R}^{m}$ represents the maximum budget for each group, and “ $\odot$ ” signifies the element-wise product between matrices, resulting the formulation as follows:

$\displaystyle\max_{X}\quad$	$\displaystyle\sum_{j=1}^{n}\frac{1}{\mathbf{1}^{T}X_{\cdot j}}X_{\cdot j}^{T}% \mathcal{S}X_{\cdot j}$	(6)
s.t.	$\displaystyle{X^{T}\mathbf{1}\geq\mathbf{1}}$
	$\displaystyle X\mathbf{1}\geq\mathbf{1}$
	$\displaystyle(B\odot X)^{T}\mathbf{1}\leq\mathbf{b}$
	$\displaystyle\|\|X_{\cdot j_{1}}-X_{\cdot j_{2}}\|\|^{2}\geq 1,j_{1}\neq j_{2}$
	$\displaystyle{X\in\{0,1\}^{n\times n}}$

The primary objective of the task grou** problem is to compute the aggregate impact of training each individual group. The quadratic form $X_{\cdot j}^{T}\mathcal{S}X_{\cdot j}$ arises from the approximation as stated in Proposition 4.3. Several key constraints are imposed to ensure the meaningful solution of this problem. The constraint ${X^{T}\mathbf{1}\geq\mathbf{1}}$ is established to guarantee that each group contains at least one task, while the constraint ${X\mathbf{1}\geq\mathbf{1}}$ ensures that all tasks are incorporated into the grou** outcomes. Additionally, diverse budgetary constraints can be introduced by incorporating the inequality $(B\odot X)^{T}\mathbf{1}\leq\mathbf{b}$ . Moreover, it is imperative that the resulting groups remain distinct, and this distinctiveness is ensured by the condition $||X_{\cdot j_{1}}-X_{\cdot j_{2}}||^{2}\geq 1$ . Remarkably, recent advancements in powerful mathematical programming solvers have significantly enhanced the efficiency with which these problems can be resolved through appropriate transformations, as elucidated in Appendix B.

4.3 Discussion on The Computation Complexity

In this part, we discuss the computational complexity of the proposed method for collecting transfer gains and compare it to the most related work, TAG (Fifty et al., 2021). We begin by defining the relevant notations: For each task $T_{i}$ , the feed-forward and backward computational costs of the neural networks are denoted as $\mathcal{F}_{i}$ and $\mathcal{B}_{i}$ , respectively. The average computation costs for these tasks are represented by $\mathcal{F}$ and $\mathcal{B}$ , where $\mathcal{B}=\frac{1}{n}\sum_{i=1}^{n}\mathcal{B}_{i}$ and $\mathcal{F}=\frac{1}{n}\sum_{i=1}^{n}\mathcal{F}_{i}$ . Based on these notations, the computation costs for TAG and our proposed method are $(n^{2}+n)\mathcal{F}+n\mathcal{B}$ and $(2n^{2}+n)\mathcal{F}+n\mathcal{B}$ , respectively. Both methods have the same order of computational complexity with respect to the number of tasks. While our method involves an additional $n^{2}$ feedforward computations, it’s worth emphasizing that feedforward costs are typically substantially smaller than backward costs in practice. Consequently, the impact of these additional feedforward computations is not significant, especially when considering the theoretical advantages our method offers. Moreover, we demonstrate in Appendix D that employing a lazy collection strategy for transfer gains leads to improved performance. Additionally, the analysis reveals that the cumulative transfer gain varies across different training phases, yielding diverse outcomes. These insights allow us to substantially reduce computational costs.

5 Experiments

Table 1: Comparative Demonstration of Single-Task Training (STL), Multi-Task Learning (MTL), and Task Grou** Approaches on the COP Benchmark. The table highlights the following representations: STL’s strong baseline in orange, the optimal baseline in red for each split, results surpassing STL in blue, and the best outcomes (excluding the optimal baseline) in bold.

	Method	TSP20	TSP50	CVRP20	CVRP50	OP20	OP50	Tot. Gap
	STL	$0.017$ %	$0.277$ %	$0.534$ %	$1.780$ %	$-0.849$ %	$1.117$ %	$2.876\%$
MTL	Naive-MTL	$0.022\%$	$0.469\%$	$0.522\%$	$2.070\%$	$-0.805\%$	$1.270\%$	$3.548\%$
	Bandit-MTL	$0.021\%$	$0.882\%$	$0.690\%$	$2.511\%$	$-0.865\%$	$2.114\%$	$5.354\%$
	PCGrad	$0.028\%$	$0.708\%$	$0.605\%$	$2.411\%$	$-0.689\%$	$1.756\%$	$4.819\%$
	UW	$0.042\%$	$0.362\%$	$0.412\%$	$1.703\%$	$-0.665\%$	$1.153\%$	$3.007\%$
	LS	$0.020\%$	$0.476\%$	$0.512\%$	$2.084\%$	$-0.792\%$	$1.197\%$	$3.498\%$
	Nash-MTL	$0.038\%$	$0.322\%$	$0.421\%$	$1.847\%$	$-0.873\%$	$1.279\%$	$3.034\%$
Task Grou**	Optimal-3	$0.022$ %	$0.307$ %	$0.404$ %	$1.692$ %	$-0.861$ %	$0.929$ %	$2.492$ %
	Random-3	$0.031\%$	$0.535\%$	$0.416\%$	$1.741\%$	$-0.764\%$	$1.279\%$	$3.237\%$
	TAG-3	$0.021\%$	$0.324\%$	$0.404\%$	$1.692\%$	$-0.771\%$	$1.104\%$	$\mathbf{2.774}\%$
	Ours-3	$0.021\%$	$0.324\%$	$0.404\%$	$1.692\%$	$-0.771\%$	$1.104\%$	$\mathbf{2.774}\%$
	Optimal-4	$0.027$ %	$0.277$ %	$0.404$ %	$1.692$ %	$-0.897$ %	$0.929$ %	$2.431$ %
	Random-4	$0.024\%$	$0.473\%$	$0.681\%$	$2.125\%$	$-0.852\%$	$1.072\%$	$3.522\%$
	TAG-4	$0.021\%$	$0.324\%$	$0.504\%$	$2.028\%$	$-0.780\%$	$1.319\%$	$2.757\%$
	Ours-4	$0.017$ %	$0.277$ %	$0.404$ %	$1.692$ %	$-0.771$ %	$1.104$ %	$\mathbf{2.723}\%$
	Optimal-5	$0.017$ %	$0.277$ %	$0.404$ %	$1.692$ %	$-0.897$ %	$0.929$ %	$2.422$ %
	Random-5	$0.038\%$	$0.444\%$	$0.472\%$	$1.992\%$	$-0.761\%$	$1.233\%$	$3.419\%$
	TAG-5	$0.021$ %	$0.307$ %	$0.404$ %	$1.692$ %	$-0.771$ %	$1.104$ %	$2.757\%$
	Ours-5	$0.021$ %	$0.324$ %	$0.404$ %	$1.692$ %	$-0.840$ %	$1.095$ %	$\mathbf{2.696}\%$

In this section, we present the experimental evaluation of our novel task grou** approach, demonstrating its effectiveness across diverse domains. Our experiments are strategically designed to validate the versatility and superiority of our method in three distinct scenarios: (1) Computer vision tasks on CelebA dataset; (2) Combinatorial optimization benchmarks from Wang & Yu (2023); and (3) Time series tasks following Song et al. (2022). The metrics utilized across all scenarios are calculated based on the best outcomes for each task within its respective groups.

Baselines Our experimental evaluation involves a comprehensive comparison against a range of established methods. These baselines are categorized into three distinct types, each representing a different approach to task learning and organization in MTL: (1) Single Task Learning (STL); (2) MTL methods: We consider a variety of MTL methods that employ different strategies for joint task learning, including: Naive-MTL, Bandit-MTL (Mao et al., 2021), PCGrad (Yu et al., 2020), Nash-MTL (Navon et al., 2022), Uncertainty-Weighting (UW) (Kendall et al., 2018) and LinearScale; (3) Task Grou** Methods: This category includes methods that explicitly group tasks based on certain criteria or strategies. The common task grou** baselines are: Random Policy: Tasks are grouped randomly and results are taken the average for 10 repeats; Optimal Policy: Obtained by enumeration, providing an upper benchmark for performance; TAG (Fifty et al., 2021): A known task grou** SOTA method that groups tasks based on their affinity. Specific differences and adaptations of these baselines will be detailed in the corresponding sections for each scenario.

5.1 Computer Vision Tasks

In this part, we investigate the implementation of our task grou** approach in the field of computer vision, specifically using the CelebA dataset for our analysis. The effectiveness of our method is quantified by measuring the total classification error rates across all tasks. To ensure experimental consistency, we adhere to the network architecture and training hyperparameters as specified in the TAG (Fifty et al., 2021), employing the same 9 tasks as in TAG. In addition to this, we include several additional baselines for a more thorough and comparative evaluation, such as High Order Approximation (HOA) (Standley et al., 2020), cosine similarity (CS), and the optimal baseline (OPT) as used in TAG.

Figure 0(a) offers a comprehensive evaluation of various baselines in terms of the Total Error metric under different group split conditions. It is observed that naive multi-task learning exhibits superior performance compared to single-task learning, and most grou** methods result in a further reduction in total error. This indicates the presence of complex underlying relationships between the tasks in this dataset, suggesting that our grou** methods can identify specific combinations that enhance overall performance. Our method consistently outperforms all baselines in terms of Total Error across all splits, with the only exception being the ‘OPT’ strategy in TAG (Fifty et al., 2021). Moreover, our method effectively leverages the granularity afforded by increased splits, demonstrating a decrease in total error with a greater number of splits, as elaborated in Appendix C.

5.2 Combinatorial Optimization Benchmarks

In this section, we focus on combinatorial optimization, specifically testing six tasks: TSP20, TSP50, CVRP20, CVRP50, OP20, and OP50. These tasks encompass various scales of the Traveling Salesman Problem (TSP), Capacitated Vehicle Routing Problem (CVRP), and Orienteering Problem (OP), offering a wide spectrum of combinatorial challenges. The neural solver used for these tasks is the POMO (Kwon et al., 2020) framework, noted for its effectiveness in addressing combinatorial optimization problems. Consistency in experimental conditions is maintained by aligning the training parameters with those outlined in the POMO study. Performance evaluation is centered on the average optimality gap, defined mathematically as:

Gap=\frac{1}{N}\sum_{i=1}^{N}(1-\frac{\text{solver}(\mathcal{I}_{i})}{\text{gt% }(\mathcal{I}_{i})})\times 100\%

evaluated over $N=10,000$ instances for each task, to measure the solution’s deviation from the ground truth obtained from Gurobi (Gurobi Optimization, LLC, 2023), with results detailed in Table 1.

In the presented comparative analysis in Table 1, Single-Task Learning (STL) demonstrates a robust baseline, outperforming all Multi-Task Learning (MTL) methods with respect to the Total Gap metric. Within the domain of Task Grou** methods, both TAG and our proposed method have shown the capability to surpass the STL baseline in certain aspects. Notably, our method consistently achieves the best performance among non-optimal baselines across each grou** strategy, indicating its efficacy in handling multiple related tasks simultaneously. As we consider the performance trends across different task grou**s, it is observed that the efficacy of Optimal, TAG, and our method improves as the splits become larger. This trend suggests that the tasks within the COP benchmark exhibit high positive transfer potential.

In the context of 3-split task grou**, both TAG and our method achieve logical grou**s of tasks, pairing (TSP20, TSP50), (CVRP20, CVRP50), and (OP20, OP50) together. This reflects an intuitive understanding that tasks of the same types benefit from being trained in concert. Intriguingly, the optimal grou**s: (TSP20, TSP50, CVRP20), (CVRP20, CVRP50) and (TSP50, CVRP20, OP20,OP50), do not align with these intuitive pairings, suggesting that there may be non-obvious correlations that, when leveraged, could lead to even greater improvements in task performance.

5.3 Time Series Tasks

In this section, the focus is on time series forecasting using the ETTm1 dataset (Wu et al., 2021), comprising seven distinct electric load time series. Following the methodology in (Song et al., 2022), each series within the ETTm1 dataset is approached as an individual forecasting task. The effectiveness of these tasks is assessed using the mean absolute error’s (MAE) relative reduction as the evaluation metric. This study employs a model architecture based on the AutoFormer framework (Wu et al., 2021), specifically designed for the intricacies and predictive challenges of multi-variate time series data.

Figure 0(b) shows that all methods exhibit significant variance across experiments, likely attributed to the inherent complexity of time series forecasting. This uncertainty presents additional challenges in accruing more precise transfer gains. Nevertheless, our method surpasses STL, MTL, and TAG in these demanding conditions, with TAG’s performance equivalent to random task grou**. Moreover, our method demonstrates the ability to produce grou**s with reduced errors when using a greater number of splits, as elaborated in Appendix C.

5.4 Constraint-Aware Task Grou**

In practical scenarios, constraints like limited computational resources, data availability, and group size requirements are prevalent. Tasks often have different data acquisition costs, necessitating budget management. In distributed learning, with resources spread across nodes, adhering to group size limits is essential. Our mathematical programming approach in Formulation 6 effectively tackles the task grou** challenge under these constraints, incorporating customized limitations to meet the practical demands of real-world applications. In this section, we address the constraint that group sizes must fall within a specified range, as shown in:

\mathbf{m_{1}}\leq X^{T}\mathbf{1}\leq\mathbf{m_{2}},

(7)

where $\mathbf{m_{1}},\mathbf{m_{2}}\in\mathbb{R}^{m}$ , applying element-wise comparison. Our experiments across computer vision, combinatorial optimization (COP), and time series scenarios demonstrate the model’s capability to adhere to these size constraints. We benchmark our method against random sampling, conducted 10 times under the same constraints, detailing the experimental setup subsequently.

Maximal Size Constraint Maximum Group Size Constraint: For this constraint, we define $\mathbf{m_{1}}$ as $\mathbf{1}$ and $\mathbf{m_{2}}$ as $M\mathbf{1}$ , where $M$ represents the maximum permissible group size, dictated by memory limitations. In the context of the CelebA dataset, which comprises 9 tasks, we set $M$ within the range $\{5,6,7,8,9\}$ . For the COP benchmark involving 6 tasks, $M$ is selected from $\{3,4,5,6\}$ , and for time series tasks comprising 7 tasks, we choose $M$ from $\{4,5,6,7\}$ .

The experimental results, as illustrated in Figure 1(a), exhibit a uniform trend across varying cases and group sizes. For example, in the CelebA dataset, our method consistently exhibits superior performance compared to the random policy. This is evident in the progressively lower Total Error rates as the maximum group size increases. Similarly, in the COP benchmarks and time series tasks with their respective maximum group size constraints, our method maintains or even enhances its performance. This consistent trend, as seen across various maximal group sizes and tasks, underscores the robustness and adaptability of our approach in accommodating changing task constraints.

Minimum Group Size Constraint In this case, the vectors $\mathbf{m_{1}}$ and $\mathbf{m_{2}}$ are defined as $m\mathbf{1}$ and $M_{\text{max}}\mathbf{1}$ , respectively, where $M_{\text{max}}$ denotes the total number of tasks specific to each case: 9 for the CelebA dataset, 6 for COP, and 7 for time series tasks. This formulation encapsulates the requirement that device utilization at each node surpasses a certain threshold, ensuring an efficient usage rate. For all three scenarios, we set $m$ to be in the set $\{1,2,3,4\}$ .

Results can be seen in Figure 1(b). In the CelebA dataset, as the minimal group size increases, our method consistently exhibits lower Total Error compared to the random policy, indicating improved efficiency in handling larger group sizes. Similarly, in the COP benchmarks, the Total Gap percentage decreases as the group size grows, highlighting our method’s effectiveness in optimizing under tighter constraints. While the performance of our method in the time series tasks under minimal group size constraints does not match the levels achieved in the CelebA and COP benchmarks, it still notably outperforms the random policy.

5.5 Ablation Study

In this section, we conduct an ablation study to underscore the significance of the novel transfer gain and the mathematical programming framework, in comparison to TAG (Fifty et al., 2021). To demonstrate the significance of our transfer gain metric, we apply the mathematical framework from formulation 6 for task grou** using $\mathcal{S}_{i\rightarrow j}^{t}$ as per formulation 4.1, and TAG’s affinity $\mathcal{Z}_{i\rightarrow j}^{t}$ , termed “Ours-MP” with our method and “TAG-MP” with TAG’s. We also assess our framework’s performance against TAG’s branch and bound techniques, introducing “Ours-BB” for Branch & Bound method guided by the transfer gain $\mathcal{S}_{i\rightarrow j}^{t}$ .

Ablation on Transfer Gain
The first aspect of the comparison focuses on the transfer gain between $\mathcal{S}_{i\rightarrow j}^{t}$ and $\mathcal{Z}_{i\rightarrow j}^{t}$ , as implemented in Ours-MP and TAG-MP, respectively. Results in Figure 3 reveal that Ours-MP consistently surpasses TAG-MP across all benchmarks. This is evidenced by its lower Total Error in the CelebA benchmark, minimized Total Gap percentage in COP and reduced Total MAE in ETIm1 with an increasing number of grou** splits. TAG-MP’s declining performance, particularly in the CelebA dataset, suggests that the transfer gain proposed in our methodology more accurately captures the task relationships than the one proposed in TAG, under a consistent task grou** solver.

Ablation on Task Grou** Solver
In Figure 3, Ours-MP demonstrates better performance and computational efficiency, providing results across all grou** splits, while Ours-BB is computationally limited, unable to process beyond a certain number of grou**s as highlighted by the gray dashed line for CelebA and ETIm1.

Table 2: Comparative Analysis of Time Efficiency Across Different Task Splits and Dataset, for CelebA with 9 tasks, COP with 6 tasks and ETTm1 with 7 tasks. ”s”, ”m” and ”h” stand for seconds, minutes and hours, respectively. ”-” indicates that the method fails to solve the problem within an 8-hour time limit.

	CelebA		COP		ETTm1
Splits	BB	MP	BB	MP	BB	MP
2	0.312s	0.907s	0.007s	0.085s	0.024s	0.122s
3	1.93m	2.255s	0.279s	0.713s	2.005s	1.078s
4	-	4.009s	5.955s	0.899s	1.63m	1.677s
5	-	7.261s	1.48m	1.621s	1.02h	2.218s
6	-	21.44s	16.96m	5.013s	-	4.158s

We demonstrate the detailed time-cost for Ours-MP and Ours-BB in Table 2, named with MP and BB in short, respectively. In particular, for CelebA, our method demonstrates exceptional coverage, capable of handling all scenarios up to 6 splits efficiently within 30 seconds. When examining the COP dataset, it is evident that the time cost for the BB method increases substantially, potentially exponentially, with larger splits, or fails to deliver results within an 8-hour limit. This trend is also observable across datasets; within the same split categories of 2 and 3, the time cost for BB grows drastically with the increment of task numbers. This highlights our method’s scalability and robustness in managing increased computational demands across varying scenarios.

6 Conclusions

In conclusion, this work introduces an innovative approach to task grou** in Multi-Task Learning (MTL), marking a substantial advancement over existing methodologies. Our novel method principally embodies two breakthroughs: a robust transfer gains construction independent of limiting assumptions, and a flexible mathematical programming formulation apt for solving task grou** issues. Empirical validation verifiably asserts the superiority of our approach across various fields, surpassing a wide range of baselines not only in terms of performance but also in flexibility and efficiency under real-world conditions. Such results highlight the capacity of our method to markedly improve the applicability and performance of MTL models across diverse and resource-restricted environments.

7 Broader Impact

This research on Multi-Task Learning (MTL) presents a novel approach to task grou** that achieves significant efficiency gains in both academic and industrial settings. It stands out for its flexibility in adapting to diverse and realistic demands, which is crucial for managing complex tasks efficiently. This adaptability is particularly important in the context of growing computational demands in large-scale data analysis. However, the approach also brings forth ethical considerations. The interpretation of inter-task affinities, if not handled cautiously, could lead to incorrect associations or biases, especially in sensitive contexts. It is imperative to recognize and address these risks to prevent potential misuse and ensure the responsible application of this technology. Despite these considerations, the method’s ability to considerably reduce computational demands, while catering to specific requirements and maintaining high accuracy, is a noteworthy advancement in MTL.

References

Byun et al. (2022) Byun, H.-E., Kim, B., and Lee, J. H. Multi-step lookahead bayesian optimization with active learning using reinforcement learning and its application to data-driven batch-to-batch optimization. Computers & Chemical Engineering, 167:107987, 2022.
Caruana (1997) Caruana, R. Multitask learning. Machine learning, 28:41–75, 1997.
Ding et al. (2023) Ding, C., Lu, Z., Wang, S., Cheng, R., and Boddeti, V. N. Mitigating task interference in multi-task learning via explicit task routing with non-learnable primitives. CVPR, pp. 7756–7765, 2023.
Dwivedi & Roig (2019) Dwivedi, K. and Roig, G. Representation similarity analysis for efficient task taxonomy & transfer learning. In CVPR, pp. 12387–12396, 2019.
Fifty et al. (2021) Fifty, C., Amid, E., Zhao, Z., Yu, T., Anil, R., and Finn, C. Efficiently identifying task grou**s for multi-task learning. Advances in Neural Information Processing Systems, 34:27503–27516, 2021.
Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pp. 1126–1135, 2017.
Guangyuan et al. (2022) Guangyuan, S., Li, Q., Zhang, W., Chen, J., and Wu, X.-M. Recon: Reducing conflicting gradients from the root for multi-task learning. In ICLR, 2022.
Guo et al. (2020) Guo, P., Lee, C.-Y., and Ulbricht, D. Learning to branch for multi-task learning. In ICML, pp. 3854–3863, 2020.
Gurobi Optimization, LLC (2023) Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023.
Huang et al. (2022) Huang, Z., Rao, M., Raju, A., Zhang, Z., Bui, B., and Lee, C. Mtl-slt: multi-task learning for spoken language tasks. In Proceedings of the 4th Workshop on NLP for Conversational AI, pp. 120–130, 2022.
Javaloy & Valera (2022) Javaloy, A. and Valera, I. Rotograd: Gradient homogenization in multitask learning. In ICLR, 2022.
Kang et al. (2011) Kang, Z., Grauman, K., and Sha, F. Learning with whom to share in multi-task feature learning. In ICML, pp. 521–528, 2011.
Kendall et al. (2018) Kendall, A., Gal, Y., and Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7482–7491, 2018.
Kumar & Daume III (2012) Kumar, A. and Daume III, H. Learning task grou** and overlap in multi-task learning. arXiv preprint arXiv:1206.6417, 2012.
Kwon et al. (2020) Kwon, Y.-D., Choo, J., Kim, B., Yoon, I., Gwon, Y., and Min, S. Pomo: Policy optimization with multiple optima for reinforcement learning. Advances in Neural Information Processing Systems, 33:21188–21198, 2020.
Lee et al. (2016) Lee, G., Yang, E., and Hwang, S. Asymmetric multi-task learning based on task relatedness and loss. In International conference on machine learning, pp. 230–238. PMLR, 2016.
Lee et al. (2018) Lee, H. B., Yang, E., and Hwang, S. J. Deep asymmetric multi-task feature learning. In ICML, pp. 2956–2964, 2018.
Liu et al. (2021a) Liu, B., Liu, X., **, X., Stone, P., and Liu, Q. Conflict-averse gradient descent for multi-task learning. Advances in Neural Information Processing Systems, 34:18878–18890, 2021a.
Liu et al. (2021b) Liu, L., Li, Y., Kuang, Z., Xue, J., Chen, Y., Yang, W., Liao, Q., and Zhang, W. Towards impartial multi-task learning. In ICLR, 2021b.
Liu et al. (2022) Liu, S., James, S., Davison, A. J., and Johns, E. Auto-lambda: Disentangling dynamic task relationships. Trans. Mach. Learn. Res., 2022, 2022.
Malhotra et al. (2022) Malhotra, A., Vatsa, M., and Singh, R. Dropped scheduled task: Mitigating negative transfer in multi-task learning using dynamic task drop**. Transactions on Machine Learning Research, 2022.
Mao et al. (2021) Mao, Y., Wang, Z., Liu, W., Lin, X., and Hu, W. Banditmtl: Bandit-based multi-task learning for text classification. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5506–5516, 2021.
Navon et al. (2022) Navon, A., Shamsian, A., Achituve, I., Maron, H., Kawaguchi, K., Chechik, G., and Fetaya, E. Multi-task learning as a bargaining game. arXiv preprint arXiv:2202.01017, 2022.
Nichol et al. (2018) Nichol, A., Achiam, J., and Schulman, J. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.
Raychaudhuri et al. (2022) Raychaudhuri, D. S., Suh, Y., Schulter, S., Yu, X., Faraki, M., Roy-Chowdhury, A. K., and Chandraker, M. Controllable dynamic multi-task architectures. CVPR, pp. 10945–10954, 2022.
Sherif et al. (2023) Sherif, A., Abid, A., Elattar, M., and ElHelw, M. Stg-mtl: Scalable task grou** for multi-task learning using data map. arXiv preprint arXiv:2307.03374, 2023.
Song et al. (2022) Song, X., Zheng, S., Cao, W., Yu, J., and Bian, J. Efficient and effective multi-task grou** via meta learning on task combinations. Advances in Neural Information Processing Systems, 35:37647–37659, 2022.
Standley et al. (2020) Standley, T., Zamir, A., Chen, D., Guibas, L., Malik, J., and Savarese, S. Which tasks should be learned together in multi-task learning? In ICML, pp. 9120–9132, 2020.
Swayamdipta et al. (2020) Swayamdipta, S., Schwartz, R., Lourie, N., Wang, Y., Hajishirzi, H., Smith, N. A., and Choi, Y. Dataset cartography: Map** and diagnosing datasets with training dynamics. In EMNLP, pp. 9275–9293. Association for Computational Linguistics, 2020.
Vandenhende et al. (2021) Vandenhende, S., Georgoulis, S., Van Gansbeke, W., Proesmans, M., Dai, D., and Van Gool, L. Multi-task learning for dense prediction tasks: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(7):3614–3633, 2021.
Wang & Yu (2023) Wang, C. and Yu, T. Efficient training of multi-task neural solver with multi-armed bandits. arXiv preprint arXiv:2305.06361, 2023.
Wang et al. (2024) Wang, C., Yu, Z., McAleer, S., Yu, T., and Yang, Y. Asp: Learn a universal neural solver! IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
Wang et al. (2020a) Wang, J., Tantia, V., Ballas, N., and Rabbat, M. Lookahead converges to stationary points of smooth non-convex functions. In ICASSP, pp. 8604–8608, 2020a.
Wang et al. (2020b) Wang, Z., Lipton, Z. C., and Tsvetkov, Y. On negative interference in multilingual models: Findings and a meta-learning treatment. arXiv preprint arXiv:2010.03017, 2020b.
Wu et al. (2021) Wu, H., Xu, J., Wang, J., and Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Advances in Neural Information Processing Systems, 34:22419–22430, 2021.
Yu et al. (2020) Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., and Finn, C. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33:5824–5836, 2020.
Yue et al. (2023) Yue, Z., Zhang, Y., and Liang, J. Learning conflict-noticed architecture for multi-task learning. In AAAI Conference on Artificial Intelligence, 2023.
Zamir et al. (2018) Zamir, A. R., Sax, A., Shen, W., Guibas, L. J., Malik, J., and Savarese, S. Taskonomy: Disentangling task transfer learning. In CVPR, pp. 3712–3722, 2018.
Zhang et al. (2022a) Zhang, L., Liu, X., and Guan, H. A tree-structured multi-task model recommender. In AutoML, 2022a.
Zhang et al. (2019a) Zhang, M., Lucas, J., Ba, J., and Hinton, G. E. Lookahead optimizer: k steps forward, 1 step back. Advances in neural information processing systems, 32, 2019a.
Zhang & Yang (2021) Zhang, Y. and Yang, Q. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609, 2021.
Zhang et al. (2019b) Zhang, Z., Wu, B., and Schuller, B. Attention-augmented end-to-end multi-task learning for emotion prediction from speech. ICASSP, pp. 6705–6709, 2019b.
Zhang et al. (2022b) Zhang, Z., Yu, W., Yu, M., Guo, Z., and Jiang, M. A survey of multi-task learning in natural language processing: Regarding task relatedness and training methods. arXiv preprint arXiv:2204.03508, 2022b.
Zhou et al. (2021) Zhou, P., Yan, H., Yuan, X., Feng, J., and Yan, S. Towards understanding why lookahead generalizes better than sgd and beyond. Advances in Neural Information Processing Systems, 34:27290–27304, 2021.

Appendix A Proof of Proposition 4.3

Proof.

Following the Taylor expansion:

		$\displaystyle\mathcal{S}_{A\rightarrow j}^{t}=\frac{\nabla_{\phi}L_{j}(\xi,% \theta_{j}^{t+1})^{T}\left(\phi_{A\cup\{j\}}^{t+1}-\phi_{\{j\}}^{t+1}\right)}{% L_{j}\left(\phi_{\{j\}}^{t+1},\theta_{j}^{t+1}\right)}=\frac{\nabla_{\phi}L_{j% }(\xi,\theta_{j}^{t+1})^{T}\sum_{i\in A}\eta_{t}\nabla_{\phi}L_{i}(\phi^{t},% \theta_{i}^{t})}{L_{j}\left(\phi_{\{j\}}^{t+1},\theta_{j}^{t+1}\right)}$
		$\displaystyle=\frac{\sum_{i\in A}\nabla_{\phi}L_{j}(\xi,\theta_{j}^{t+1})^{T}% \left(\phi_{\{i,j\}}^{t+1}-\phi_{\{j\}}^{t+1}\right)}{L_{j}\left(\phi_{\{j\}}^% {t+1},\theta_{j}^{t+1}\right)}$
		$\displaystyle=\frac{1}{\|A\|}\sum_{i\in A}\mathcal{S}_{i\rightarrow j}^{t}+\frac% {\sum_{i\in A}\nabla_{\phi}(L_{j}(\xi,\theta_{j}^{t+1})-\frac{1}{\|A\|}L_{j}(\xi% _{j},\theta_{j}^{t+1}))^{T}\eta_{t}\nabla_{\phi}L_{i}(\phi^{t},\theta_{i}^{t})% }{L_{j}\left(\phi_{\{j\}}^{t+1},\theta_{j}^{t+1}\right)}$
		$\displaystyle=\frac{1}{\|A\|}\sum_{i\in A}\mathcal{S}_{i\rightarrow j}^{t}+\eta_% {t}\frac{\sum_{i\in A}\|\|\nabla_{\phi}(L_{j}(\xi,\theta_{j}^{t+1})-\frac{1}{\|A\|% }L_{j}(\xi_{j},\theta_{j}^{t+1}))\|\|\cdot\|\|\nabla_{\phi}L_{i}(\phi^{t},\theta_{% i}^{t})\|\|}{L_{j}\left(\phi_{\{j\}}^{t+1},\theta_{j}^{t+1}\right)}$
		$\displaystyle\leq\frac{1}{\|A\|}\sum_{i\in A}\mathcal{S}_{i\rightarrow j}^{t}+% \frac{\eta_{t}(1+\|A\|)L^{2}}{C}$

∎

Appendix B Formulation Transformation

We here show the transformation on Formulation 6 to the Mix-Integer Quadratic Programming with non-linear constraints: We introduce a continuous variable $\mathbf{y}\in[0,1]^{n}$ and binary variables $Z_{ijk}$ to obtain:

$\displaystyle\max_{X,y}\quad$	$\displaystyle\sum_{j=1}^{m}{\sum_{k=1}^{n}\sum_{i=1}^{n}S_{ik}y_{j}Z_{ijk}}$	(8)
s.t.	$\displaystyle{X^{T}\mathbf{1}\geq\mathbf{1}}$
	$\displaystyle X\mathbf{1}\geq\mathbf{1}$
	$\displaystyle(B\odot X)^{T}\mathbf{1}\leq b$
	$\displaystyle X^{T}\mathbf{y}=\mathbf{1}$
	$\displaystyle Z_{ijk}=X_{ij}\cdot X_{kj}\ \forall i,j,k$
	$\displaystyle\|\|X_{\cdot j_{1}}-X_{\cdot j_{2}}\|\|^{2}\geq 1,j_{1}\neq j_{2}$
	$\displaystyle{X\in\{0,1\}^{n\times n}}$

which can be solved by classical solvers (We use Gurobi(Gurobi Optimization, LLC, 2023) in this work).

Appendix C Further Results on CelebA and ETTm1

This section provides a comprehensive overview of the experimental settings and results, including the model architectures employed, hyperparameters, benchmark methods, and modifications implemented to ensure a more equitable comparison. The experiments were conducted on a server equipped with 8 NVIDIA GeForce RTX 3090 GPUs and an AMD EPYC 7542 32-Core CPU. The primary software versions used are CUDA 11.8, TensorFlow 2.14.1, and PyTorch 2.1.2.

C.1 Computer Vision Tasks

In the computer vision tasks, the employed neural network structure is a ResNet16 encoder coupled with MLP decoders for each task. Model structure hyperparameters and training attributes such as hidden dimensions, encoder layers, initial learning rate and scheduling method mirror those in (Fifty et al., 2021). We compare our approach against four benchmark MTL methodologies, namely, Naive-MTL, GradNrom, PCGrad, and Uncertainty Weights. Notably, we directly implement the precise grou**s presented by (Fifty et al., 2021) as the grou** outcome of TAG, HOA, and CS methods, though the total error is based on our testing outcomes. As Table 3 exhibits, our method typically surpasses other grou** methodologies and MTL methods, barring OPT grou**s, derived from TAG. These may not necessarily represent the optimal grou**s in our test results, as our approach outperforms OPT in the 2-split scenario.

The CelebA dataset, comprising over 200K celebrity images with 40 attribute annotations each, serves as a large-scale dataset for facial attributes. We select 9 attributes for our tasks: 5 o’Clock Shadow, Black Hair, Blond Hair, Brown Hair, Goatee, Mustache, No Beard, Rosy Cheeks, and Wearing Hat, aligning with the attributes employed in TAG.

Table 3: Comparison among our proposed method, single task training (STL) and multi-task learning (MTL) methods on the CelebA dataset. The table highlights the following representations: MTL’s strong baseline in orange, the OPT baseline in red for each split, results surpassing STL in blue, and the best outcomes (excluding the OPT baseline) in bold.

	Method	a1	a2	a3	a4	a5	a6	a7	a8	a9	Total Error
	STL	$6.47\pm 0.044$	$11.27\pm 0.037$	$4.19\pm 0.006$	$12.29\pm 0.020$	$2.72\pm 0.038$	$3.12\pm 0.015$	$4.98\pm 0.031$	$4.85\pm 0.019$	$0.73\pm 0.007$	$50.62\pm 0.031$
MTL	Naive-MTL	$6.55\pm 0.016$	$11.09\pm 0.023$	$4.19\pm 0.014$	$12.56\pm 0.101$	$2.58\pm 0.015$	$3.02\pm 0.027$	$4.80\pm 0.017$	$4.74\pm 0.034$	$0.70\pm 0.004$	$50.23\pm 0.150$
	GradNorm	$7.18\pm 0.071$	$11.35\pm 0.028$	$4.21\pm 0.034$	$12.18\pm 0.127$	$2.52\pm 0.015$	$2.85\pm 0.032$	$5.01\pm 0.084$	$5.29\pm 0.103$	$0.72\pm 0.019$	$51.31\pm 0.367$
	PCGrad	$6.58\pm 0.101$	$11.12\pm 0.019$	$4.27\pm 0.048$	$12.67\pm 0.722$	$2.61\pm 0.034$	$2.92\pm 0.010$	$4.96\pm 0.204$	$5.02\pm 0.021$	$0.69\pm 0.030$	$50.83\pm 0.583$
	UW	$6.72\pm 0.037$	$11.32\pm 0.019$	$4.30\pm 0.150$	$13.61\pm 0.541$	$2.74\pm 0.049$	$2.93\pm 0.051$	$5.43\pm 0.175$	$4.86\pm 0.040$	$0.79\pm 0.025$	$52.71\pm 0.528$
Task Grou**	OPT-2	$6.60\pm 0.017$	$11.16\pm 0.002$	$4.04\pm 0.028$	$12.05\pm 0.133$	$2.60\pm 0.010$	$2.88\pm 0.011$	$4.82\pm 0.037$	$4.71\pm 0.037$	$0.74\pm 0.025$	$49.58\pm 0.154$
	Random-2	$6.41\pm 0.031$	$11.10\pm 0.019$	$4.14\pm 0.009$	$12.51\pm 0.033$	$2.66\pm 0.007$	$2.92\pm 0.017$	$4.71\pm 0.027$	$4.75\pm 0.025$	$0.78\pm 0.011$	$49.99\pm 0.033$
	TAG-2	$6.50\pm 0.158$	$11.16\pm 0.053$	$4.24\pm 0.038$	$13.15\pm 0.112$	$2.66\pm 0.004$	$2.99\pm 0.006$	$4.88\pm 0.118$	$4.74\pm 0.010$	$0.71\pm 0.005$	$51.04\pm 0.305$
	HOA-2	$6.55\pm 0.018$	$11.27\pm 0.026$	$4.14\pm 0.009$	$11.87\pm 0.045$	$2.57\pm 0.019$	$3.03\pm 0.016$	$4.69\pm 0.024$	$4.81\pm 0.025$	$0.67\pm 0.006$	$49.60\pm 0.046$
	CS-2	$6.58\pm 0.028$	$11.15\pm 0.027$	$4.26\pm 0.097$	$12.35\pm 0.229$	$2.54\pm 0.029$	$3.06\pm 0.017$	$4.86\pm 0.059$	$4.77\pm 0.083$	$0.71\pm 0.017$	$50.28\pm 0.254$
	Ours-2	$6.55\pm 0.016$	$11.09\pm 0.023$	$4.19\pm 0.014$	$12.05\pm 0.133$	$2.58\pm 0.015$	$2.88\pm 0.011$	$4.80\pm 0.017$	$4.71\pm 0.037$	$0.70\pm 0.004$	$49.53\pm 0.128$
	OPT-3	$6.27\pm 0.027$	$11.16\pm 0.002$	$4.04\pm 0.028$	$12.24\pm 0.367$	$2.59\pm 0.006$	$2.92\pm 0.011$	$4.67\pm 0.013$	$4.72\pm 0.019$	$0.71\pm 0.009$	$49.32\pm 0.407$
	Random-3	$6.65\pm 0.030$	$11.16\pm 0.018$	$4.12\pm 0.019$	$12.16\pm 0.147$	$2.61\pm 0.065$	$2.90\pm 0.019$	$4.78\pm 0.018$	$4.71\pm 0.010$	$0.79\pm 0.012$	$49.88\pm 0.128$
	TAG-3	$6.50\pm 0.158$	$11.14\pm 0.047$	$4.11\pm 0.007$	$12.31\pm 0.030$	$2.55\pm 0.029$	$2.92\pm 0.063$	$4.88\pm 0.118$	$4.73\pm 0.013$	$0.76\pm 0.014$	$49.90\pm 0.249$
	HOA-3	$6.55\pm 0.024$	$11.14\pm 0.047$	$4.12\pm 0.023$	$12.31\pm 0.030$	$2.57\pm 0.019$	$2.84\pm 0.002$	$4.69\pm 0.024$	$4.76\pm 0.011$	$0.73\pm 0.012$	$49.72\pm 0.143$
	CS-3	$6.50\pm 0.158$	$11.18\pm 0.008$	$4.10\pm 0.013$	$13.06\pm 0.105$	$2.59\pm 0.026$	$3.06\pm 0.017$	$4.88\pm 0.118$	$4.83\pm 0.013$	$0.74\pm 0.004$	$50.95\pm 0.360$
	Ours-3	$6.55\pm 0.016$	$11.09\pm 0.023$	$4.19\pm 0.014$	$12.05\pm 0.133$	$2.58\pm 0.015$	$2.88\pm 0.011$	$4.80\pm 0.017$	$4.69\pm 0.027$	$0.70\pm 0.004$	$49.52\pm 0.104$
	OPT-4	$6.30\pm 0.036$	$11.11\pm 0.020$	$4.08\pm 0.079$	$11.92\pm 0.012$	$2.68\pm 0.019$	$2.94\pm 0.021$	$4.67\pm 0.013$	$4.77\pm 0.006$	$0.71\pm 0.009$	$49.17\pm 0.009$
	Random-4	$6.40\pm 0.023$	$11.25\pm 0.029$	$4.11\pm 0.010$	$12.92\pm 0.102$	$2.55\pm 0.014$	$3.01\pm 0.002$	$4.74\pm 0.023$	$4.75\pm 0.006$	$0.76\pm 0.013$	$50.49\pm 0.117$
	TAG-4	$6.50\pm 0.158$	$10.97\pm 0.023$	$4.09\pm 0.033$	$12.31\pm 0.030$	$2.62\pm 0.010$	$2.99\pm 0.004$	$4.76\pm 0.009$	$4.63\pm 0.012$	$0.72\pm 0.002$	$49.60\pm 0.128$
	HOA-4	$6.65\pm 0.033$	$11.14\pm 0.047$	$4.11\pm 0.007$	$12.31\pm 0.030$	$2.55\pm 0.029$	$2.92\pm 0.063$	$4.69\pm 0.024$	$4.73\pm 0.013$	$0.76\pm 0.014$	$49.85\pm 0.080$
	CS-4	$6.50\pm 0.158$	$10.97\pm 0.023$	$4.09\pm 0.033$	$12.31\pm 0.030$	$2.59\pm 0.026$	$3.06\pm 0.017$	$4.88\pm 0.118$	$4.63\pm 0.012$	$0.72\pm 0.002$	$49.75\pm 0.233$
	Ours-4	$6.55\pm 0.016$	$11.09\pm 0.023$	$4.09\pm 0.015$	$12.05\pm 0.133$	$2.58\pm 0.015$	$2.88\pm 0.011$	$4.72\pm 0.053$	$4.69\pm 0.027$	$0.70\pm 0.004$	$49.33\pm 0.127$
	Random-5	$6.67\pm 0.016$	$11.20\pm 0.098$	$4.10\pm 0.045$	$11.83\pm 0.193$	$2.61\pm 0.004$	$2.87\pm 0.004$	$5.02\pm 0.043$	$4.81\pm 0.078$	$0.80\pm 0.014$	$49.91\pm 0.362$
	Ours-5	$6.55\pm 0.016$	$11.09\pm 0.023$	$4.09\pm 0.015$	$12.05\pm 0.133$	$2.54\pm 0.029$	$2.88\pm 0.011$	$4.72\pm 0.053$	$4.69\pm 0.027$	$0.70\pm 0.004$	$49.30\pm 0.092$
	Random-6	$6.49\pm 0.022$	$11.23\pm 0.029$	$4.10\pm 0.018$	$11.79\pm 0.023$	$2.57\pm 0.017$	$2.95\pm 0.008$	$4.83\pm 0.019$	$4.71\pm 0.030$	$0.77\pm 0.009$	$49.45\pm 0.028$
	Ours-6	$6.55\pm 0.016$	$11.09\pm 0.023$	$4.09\pm 0.015$	$12.05\pm 0.133$	$2.54\pm 0.029$	$2.88\pm 0.011$	$4.72\pm 0.053$	$4.69\pm 0.027$	$0.70\pm 0.004$	$49.30\pm 0.092$

C.2 Time Series Tasks

In the domain of time series analysis, the Autoformer architecture is employed as the neural network structure. Time series forecasting encompasses two widely recognized approaches: multivariate and univariate. Given that the ETTm1 dataset comprises seven time series, it is applicable under both frameworks. To facilitate a task grou** experiment, we configure it as several univariate prediction tasks, adapting the Autoformer model to maintain a majority of its components common across tasks, while assigning a unique decoder to each task for making predictions. Regarding the detailed hyperparameter settings for the model’s structure and training, we adhere to the configurations specified by (Song et al., 2022).

Table 4 reveals that our approach surpasses other task grou** strategies across all divisions. Nevertheless, it falls marginally short of the top MTL method, PCGrade, in the 2-split and 3-split scenarios. Despite this, our method continues to outperform all MTL techniques in terms of overall Mean Absolute Error (MAE) as the number of splits increases. Furthermore, the results for TAG and Optimal across the 4-split scenario were not obtained due to intractable computational constraints.

Table 4: Comparison among our proposed method, single task training (STL) and multi-task learning (MTL) methods on the ETTm1 dataset. The table highlights the following representations: STL’s strong baseline in orange, the optimal baseline in red for each split, results surpassing STL in blue, and the best outcomes (excluding the optimal baseline) in bold.

	Method	a1	a2	a3	a4	a5	a6	a7	Total MAE
	STL	$0.64\pm 0.016$	$0.37\pm 0.009$	$0.68\pm 0.015$	$0.36\pm 0.011$	$0.56\pm 0.027$	$0.29\pm 0.002$	$0.15\pm 0.006$	$3.05\pm 0.030$
MTL	Bandit-MTL	$0.67\pm 0.041$	$0.37\pm 0.010$	$0.68\pm 0.051$	$0.34\pm 0.003$	$0.58\pm 0.017$	$0.26\pm 0.004$	$0.14\pm 0.003$	$3.05\pm 0.104$
	LS	$0.67\pm 0.014$	$0.38\pm 0.004$	$0.67\pm 0.040$	$0.36\pm 0.007$	$0.58\pm 0.023$	$0.28\pm 0.003$	$0.15\pm 0.002$	$3.09\pm 0.034$
	UW	$0.67\pm 0.011$	$0.38\pm 0.014$	$0.62\pm 0.011$	$0.36\pm 0.013$	$0.59\pm 0.021$	$0.27\pm 0.011$	$0.16\pm 0.013$	$3.06\pm 0.079$
	Nash-MTL	$0.61\pm 0.017$	$0.38\pm 0.011$	$0.65\pm 0.006$	$0.36\pm 0.004$	$0.59\pm 0.037$	$0.27\pm 0.003$	$0.15\pm 0.013$	$3.00\pm 0.059$
	PCGrad	$0.65\pm 0.033$	$0.38\pm 0.011$	$0.62\pm 0.020$	$0.35\pm 0.001$	$0.57\pm 0.006$	$0.27\pm 0.011$	$0.15\pm 0.001$	$2.99\pm 0.066$
	Naive-MTL	$0.66\pm 0.014$	$0.38\pm 0.006$	$0.64\pm 0.056$	$0.35\pm 0.009$	$0.59\pm 0.016$	$0.28\pm 0.005$	$0.15\pm 0.009$	$3.03\pm 0.101$
Task Grou**	Optimal-2	$0.61\pm 0.015$	$0.37\pm 0.007$	$0.62\pm 0.036$	$0.35\pm 0.003$	$0.56\pm 0.029$	$0.28\pm 0.008$	$0.15\pm 0.011$	$2.93\pm 0.064$
	TAG-2	$0.66\pm 0.014$	$0.37\pm 0.005$	$0.64\pm 0.056$	$0.35\pm 0.009$	$0.59\pm 0.016$	$0.28\pm 0.005$	$0.15\pm 0.009$	$3.03\pm 0.100$
	Random-2	$0.66\pm 0.014$	$0.38\pm 0.006$	$0.64\pm 0.056$	$0.35\pm 0.009$	$0.59\pm 0.016$	$0.28\pm 0.005$	$0.15\pm 0.009$	$3.03\pm 0.101$
	Ours-2	$0.66\pm 0.014$	$0.38\pm 0.006$	$0.64\pm 0.056$	$0.35\pm 0.009$	$0.57\pm 0.017$	$0.27\pm 0.002$	$0.15\pm 0.009$	$3.01\pm 0.096$
	Optimal-3	$0.61\pm 0.015$	$0.37\pm 0.007$	$0.62\pm 0.036$	$0.35\pm 0.003$	$0.55\pm 0.022$	$0.27\pm 0.015$	$0.15\pm 0.011$	$2.91\pm 0.057$
	TAG-3	$0.63\pm 0.030$	$0.38\pm 0.008$	$0.65\pm 0.017$	$0.36\pm 0.007$	$0.57\pm 0.022$	$0.29\pm 0.016$	$0.16\pm 0.001$	$3.04\pm 0.063$
	Random-3	$0.67\pm 0.025$	$0.39\pm 0.004$	$0.65\pm 0.031$	$0.35\pm 0.004$	$0.57\pm 0.020$	$0.28\pm 0.009$	$0.14\pm 0.001$	$3.05\pm 0.060$
	Ours-3	$0.66\pm 0.014$	$0.38\pm 0.006$	$0.64\pm 0.056$	$0.35\pm 0.009$	$0.56\pm 0.011$	$0.27\pm 0.002$	$0.15\pm 0.009$	$3.00\pm 0.079$
	Optimal-4	$0.61\pm 0.015$	$0.37\pm 0.007$	$0.62\pm 0.036$	$0.35\pm 0.003$	$0.55\pm 0.022$	$0.27\pm 0.015$	$0.14\pm 0.001$	$2.90\pm 0.051$
	TAG-4	$0.63\pm 0.004$	$0.38\pm 0.008$	$0.63\pm 0.030$	$0.36\pm 0.013$	$0.58\pm 0.025$	$0.28\pm 0.008$	$0.16\pm 0.007$	$3.03\pm 0.053$
	Random-4	$0.64\pm 0.026$	$0.38\pm 0.005$	$0.68\pm 0.043$	$0.35\pm 0.005$	$0.55\pm 0.022$	$0.27\pm 0.005$	$0.15\pm 0.011$	$3.02\pm 0.054$
	Ours-4	$0.63\pm 0.030$	$0.38\pm 0.007$	$0.65\pm 0.055$	$0.36\pm 0.004$	$0.56\pm 0.011$	$0.27\pm 0.002$	$0.14\pm 0.005$	$2.99\pm 0.098$
	Random-5	$0.65\pm 0.032$	$0.37\pm 0.005$	$0.65\pm 0.028$	$0.36\pm 0.007$	$0.57\pm 0.019$	$0.27\pm 0.005$	$0.14\pm 0.012$	$3.02\pm 0.063$
	Ours-5	$0.63\pm 0.030$	$0.38\pm 0.007$	$0.65\pm 0.055$	$0.36\pm 0.004$	$0.56\pm 0.011$	$0.27\pm 0.002$	$0.14\pm 0.005$	$2.99\pm 0.098$
	Random-6	$0.64\pm 0.026$	$0.38\pm 0.007$	$0.68\pm 0.015$	$0.35\pm 0.005$	$0.56\pm 0.027$	$0.28\pm 0.004$	$0.15\pm 0.007$	$3.03\pm 0.056$
	Ours-6	$0.63\pm 0.030$	$0.38\pm 0.007$	$0.65\pm 0.055$	$0.36\pm 0.004$	$0.56\pm 0.011$	$0.27\pm 0.002$	$0.14\pm 0.005$	$2.99\pm 0.098$
	Ours-7	$0.63\pm 0.030$	$0.38\pm 0.007$	$0.65\pm 0.055$	$0.36\pm 0.004$	$0.56\pm 0.011$	$0.27\pm 0.002$	$0.14\pm 0.005$	$2.99\pm 0.098$

C.3 Visualized Transfer Gains

The pairwise transfer gains collected by our groups are visualized in Figure 4. The heatmap reveals that nearly all tasks achieve the highest transfer gain with themselves, while exhibiting varying transfer gain distributions across other tasks. This variation underscores an opportunity for improving the overall performance through a task grou** method. Moreover, certain subsets within the heatmap demonstrate overall high transfer gains among tasks, suggesting these tasks form an effective group naturally. It is also noteworthy that the heatmap for ETTm1 (Figure 3(c)) is less structured compared to those for the other two tasks. This lack of structure introduces additional challenges in generating effective grou** results, thereby presenting more difficult scenarios for task grou**.

Appendix D Analysis of Collection Strategy on Transfer Gain

This section presents a comprehensive analysis of the effects of transfer gain collection frequency and the number of epochs required for our method on the final performance metrics.

Table 5 assesses the impact of varying data collection frequencies on the relative improvement and speedup for two datasets: CelebA and ETTm1, with collecting intervals ranging from 1 to 200 for both datasets. In the CelebA dataset (Table 4(a)), relative improvement initially increases with decreasing frequency, suggesting that less frequent data collection, up to a step interval of 50, yields better outcomes than a random policy by potentially reducing over-estimation risk and thus enhancing task grou** performance. Beyond an interval of 50, however, performance diminishes due to the inaccuracy in transfer gain estimation from insufficient data. A similar trend is observed in the ETTm1 dataset (Table 4(b)). This implies that our method can achieve up to a tenfold speedup by employing a lazy collection strategy with intervals of 10 or 50 steps, while maintaining performance and potentially even enhancing it.

(a) CelebA

Steps	Relative Improvement $(\uparrow)$	Relative Speedup $(\uparrow)$
1	3.50%	1.00
5	3.53%	4.25
10	3.94%	7.18
25	4.60%	12.16
50	5.13%	15.95
100	1.83%	15.95
200	3.06%	18.62

(b) ETTm1

Steps	Relative Improvement $(\uparrow)$	Relative Speedup $(\uparrow)$
1	5.76%	1.00
5	9.48%	4.62
10	9.20%	8.45
25	8.75%	16.64
50	8.75%	24.94
100	6.70%	33.24
200	5.87%	39.16

Table 5: Variation in total test errors across different frequencies of transfer gain collection, compared to test errors from random grou**s, with speedup evaluated based on the computation of transfer gains at each step. The ’Steps’ mentioned in the table correspond to the intervals between successive transfer gain collection steps.

Figure 5 delineates the performance trends across different epochs for the CelebA and ETTm1 datasets. For CelebA, the total error exhibits a marked decrease as the number of epochs increases, stabilizing after approximately 20 epochs. This stabilization suggests that the model quickly benefits from initial training iterations but reaches a plateau, indicating little to no gain from additional training beyond this point. The shaded area. Conversely, the ETTm1 dataset shows a more gradual decline in total Mean Absolute Error (MAE) as the number of epochs grows. The initial drop in MAE is quite steep, suggesting significant learning gains from early training. Subsequently, the MAE curve flattens out after about 6 epochs, which implies that further training yields diminishing improvements in model performance.

These findings demonstrate that the accumulation of transfer gains does not occur uniformly throughout all training periods; instead, it varies, with certain training phases yielding more substantial enhancements than others. Drawing on these insights, we can deduce several empirical guidelines for optimizing the collection of transfer gains with regard to efficiency: (1) The initial training period is crucial for uncovering task relationships, indicating the importance of concentrating resources on the early stages of training; (2) A range of 5-50 steps is considered optimal for gathering transfer gains, as it is probable that the gains from consecutive steps will be similar.

		$\displaystyle\mathcal{S}_{A\rightarrow j}^{t}=\frac{\nabla_{\phi}L_{j}(\xi,% \theta_{j}^{t+1})^{T}\left(\phi_{A\cup\{j\}}^{t+1}-\phi_{\{j\}}^{t+1}\right)}{% L_{j}\left(\phi_{\{j\}}^{t+1},\theta_{j}^{t+1}\right)}=\frac{\nabla_{\phi}L_{j% }(\xi,\theta_{j}^{t+1})^{T}\sum_{i\in A}\eta_{t}\nabla_{\phi}L_{i}(\phi^{t},% \theta_{i}^{t})}{L_{j}\left(\phi_{\{j\}}^{t+1},\theta_{j}^{t+1}\right)}$
		$\displaystyle=\frac{\sum_{i\in A}\nabla_{\phi}L_{j}(\xi,\theta_{j}^{t+1})^{T}% \left(\phi_{\{i,j\}}^{t+1}-\phi_{\{j\}}^{t+1}\right)}{L_{j}\left(\phi_{\{j\}}^% {t+1},\theta_{j}^{t+1}\right)}$
		$\displaystyle=\frac{1}{\|A\|}\sum_{i\in A}\mathcal{S}_{i\rightarrow j}^{t}+\frac% {\sum_{i\in A}\nabla_{\phi}(L_{j}(\xi,\theta_{j}^{t+1})-\frac{1}{\|A\|}L_{j}(\xi% _{j},\theta_{j}^{t+1}))^{T}\eta_{t}\nabla_{\phi}L_{i}(\phi^{t},\theta_{i}^{t})% }{L_{j}\left(\phi_{\{j\}}^{t+1},\theta_{j}^{t+1}\right)}$
		$\displaystyle=\frac{1}{\|A\|}\sum_{i\in A}\mathcal{S}_{i\rightarrow j}^{t}+\eta_% {t}\frac{\sum_{i\in A}\|\|\nabla_{\phi}(L_{j}(\xi,\theta_{j}^{t+1})-\frac{1}{\|A\|% }L_{j}(\xi_{j},\theta_{j}^{t+1}))\|\|\cdot\|\|\nabla_{\phi}L_{i}(\phi^{t},\theta_{% i}^{t})\|\|}{L_{j}\left(\phi_{\{j\}}^{t+1},\theta_{j}^{t+1}\right)}$
		$\displaystyle\leq\frac{1}{\|A\|}\sum_{i\in A}\mathcal{S}_{i\rightarrow j}^{t}+% \frac{\eta_{t}(1+\|A\|)L^{2}}{C}$