License: arXiv.org perpetual non-exclusive license
arXiv:2402.15328v1 [cs.LG] 23 Feb 2024

Towards Principled Task Grou** for Multi-Task Learning

Chenguang Wang    Xuanhao Pan    Tianshu Yu
Abstract

This paper presents a novel approach to task grou** in Multitask Learning (MTL), advancing beyond existing methods by addressing key theoretical and practical limitations. Unlike prior studies, our approach offers a more theoretically grounded method that does not rely on restrictive assumptions for constructing transfer gains. We also propose a flexible mathematical programming formulation which can accommodate a wide spectrum of resource constraints, thus enhancing its versatility. Experimental results across diverse domains, including computer vision datasets, combinatorial optimization benchmarks and time series tasks, demonstrate the superiority of our method over extensive baselines, validating its effectiveness and general applicability in MTL.

Machine Learning, ICML

1 Introduction

Multitask Learning (MTL) (Caruana, 1997; Zhang & Yang, 2021; Vandenhende et al., 2021) represents a forefront area in machine learning, aiming to improve learning efficiency and prediction accuracy by leveraging commonalities and differences across multiple tasks, reflected by the so-called intra-task “transfer gain”. Building upon this foundational concept, MTL has exhibited exceptional performance across a spectrum of domains, including computer vision (Standley et al., 2020; Fifty et al., 2021; Song et al., 2022; Sherif et al., 2023), NLP (Zhang et al., 2022b; Ding et al., 2023), Neural Architecture Search (Guo et al., 2020; Zhang et al., 2022a; Raychaudhuri et al., 2022; Yue et al., 2023), speech recognition (Zhang et al., 2019b; Huang et al., 2022) and combinatorial optimization problems (Wang & Yu, 2023; Wang et al., 2024). Central to the optimization of this framework is the concept of task grou**. Task grou** (Kang et al., 2011; Kumar & Daume III, 2012; Lee et al., 2016, 2018; Zamir et al., 2018; Dwivedi & Roig, 2019; Malhotra et al., 2022; Standley et al., 2020; Fifty et al., 2021; Song et al., 2022) in MTL involves strategically dividing a set of tasks into several groups, where each group encapsulates tasks that share maximal positive transfer while minimizing negative transfer.

Recent studies such as those by Standley et al. (2020); Fifty et al. (2021) have contributed significantly to this domain. Both works utilize a methodology where specific task affinities are collected in a single run of the MTL training, which is then used to group tasks based on the assumption of high-order approximation on the task relationships. Subsequently, these groups are trained separately using MTL methods. However, these approaches exhibit key limitations. Firstly, there is an absence of a theoretical guarantee in their task affinity measures, raising concerns about the reliability and predictability of the task grou** effectiveness. Secondly, they rely on an enumeration-based branch and bound algorithm for solving the task grou** problem. This approach not only sacrifices efficiency in terms of computational resources but also poses challenges in incorporating additional constraints, limiting its practical applicability in more complex and realistic scenarios.

In this work, we introduce a novel approach to task grou** in MTL that addresses these limitations and offers significant advancements over existing methodologies. First, we propose a theoretically grounded method to constructing transfer gains. Different from TAG (Fifty et al., 2021), which makes restricted assumptions like convexity and smoothness on loss functions, the proposed transfer gain is derived independent of any conditions. Furthermore, it maintains computational complexity at the same order as TAG by following the high-order approximation assumption regarding task relationships, as utilized in prior work (Standley et al., 2020; Fifty et al., 2021). Second, our work introduces a generic and flexible mathematical programming formulation to solve task grou** problems. This formulation can readily incorporate a variety of budget constraints – a critical aspect in real-world applications. By doing so, our method ensures the practicality and adaptability of MTL models in diverse scenarios, ranging from computational budget allocation to resource utilization considerations.

Our experimental evaluations across diverse domains, including computer vision datasets like CelebA, combinatorial optimization benchmarks and time series datasets, demonstrate the validity and generality of our proposed task grou** strategy in three-folds. First, we establish that our method consistently outperforms a wide range of baselines, encompassing single-task learning, multi-task learning, and various task grou** methods. This substantiates its effectiveness across these three diverse domains. Secondly, we illustrate the flexibility and effectiveness of our proposed mathematical programming formulation by introducing various constraints, mirroring real-world scenarios where resource budgets, such as GPU memory limitations and resource utilization, come into play. Our results demonstrate that our method significantly outperforms the baseline methods, showcasing its adaptability and performance improvement under such constraints. Finally, we provide a comprehensive ablation study, highlighting the superiority of our proposed transfer gain and the efficiency and effectiveness of our mathematical programming formulation comparing with TAG (Fifty et al., 2021).

In summary, this work makes several key contributions in the realm of task grou**:

  • We propose a theoretically principled approach to constructing transfer gains without relying on restrictive assumptions;

  • To solve task grou** problems, we introduce a mathematically generic and flexible programming formulation, capable of seamlessly integrating various budget constraints;

  • Through extensive experiments, we demonstrate the effectiveness of our task grou** strategy across diverse domains. Furthermore, we empirically showcase the flexibility of our mathematical programming approach by addressing realistic constraints.

2 Related Works

Task Grou**. Task grou** in multitask learning involves organizing tasks into groups based on their inherent relationships or similarities. The idea is to exploit shared knowledge within each group to improve overall learning efficiency. Most early works utilized normalization terms to partition the model parameters to align with the grou** of tasks (Kang et al., 2011; Kumar & Daume III, 2012; Lee et al., 2016). Lee et al. (2018) extends the scenario of that in Lee et al. (2016) to the deep learning area, aiming to model the asymmetric task relationships by an autoencoder. Zamir et al. (2018) presented “Taskonomy”, which disentangles task relationships and provides a more refined methodology for task grou** based on transfer learning hierarchies. Dwivedi & Roig (2019) introduced a novel approach to task taxonomy in multitask learning through representation similarity analysis, showing effectiveness and efficiency for generating task relationships on Taskonomy dataset. Malhotra et al. (2022) contributed to the task grou** discourse by introducing scheduled task mitigation which dynamically sequences task learning for improved multitask performance. Closely related to our work, Standley et al. (2020); Fifty et al. (2021) apply the two-stage methodology to solve task grou** problems: First collecting the training information and defining task affinities; Then using Branch and Bound algorithm to search for the best grou** results. For more recent works, Song et al. (2022) tried to solve task grou** problems through a Meta-Learning approach which creates a meta-learner to estimate task grou** gains. Sherif et al. (2023) proposed the data-driven method based on Data Map (Swayamdipta et al., 2020) to capture the task similarity on the fly which can relieve the scalability issue.

Lookahead Methods. The philosophy of Lookahead methods is using future information to guide the current state, which has been widely used in meta-learning (Finn et al., 2017; Nichol et al., 2018; Wang et al., 2020b), multitask learning (Fifty et al., 2021) and optimization techniques (Zhang et al., 2019a; Wang et al., 2020a; Zhou et al., 2021; Byun et al., 2022). In particular for multitask learning, Fifty et al. (2021) collected the one-step-forward loss information between task pairs for each gradient updating and constructed the overall task affinity matrix at the end of training. We follow this idea but construct a more principled lookahead-based metric.

Loss Balance. Numerous works have emerged to address multitask learning by exploring the balance on the losses from different tasks (Mao et al., 2021; Yu et al., 2020; Javaloy & Valera, 2022; Navon et al., 2022; Kendall et al., 2018; Liu et al., 2021a, b; Guangyuan et al., 2022; Liu et al., 2022). In these works, various loss reweighing mechanisms are designed to dynamically balance the importance of each task, which can relieve the negative transfer among tasks in terms of gradient information. Being plug-and-play techniques, loss balance methods can be flexibly applied to existing multitask learning frameworks.

3 Preliminary

Definition 3.1.

(Multitask Learning) Consider a set of tasks 𝒯={Ti,i[n]}𝒯subscript𝑇𝑖𝑖delimited-[]𝑛\mathcal{T}=\{T_{i},i\in[n]\}caligraphic_T = { italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ [ italic_n ] }, where [n]={1,2,,n}delimited-[]𝑛12𝑛[n]=\{1,2,...,n\}[ italic_n ] = { 1 , 2 , … , italic_n } and each task Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is associated with a learning objective Li(ϕ,θi)subscript𝐿𝑖italic-ϕsubscript𝜃𝑖L_{i}(\phi,\theta_{i})italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ϕ , italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where ϕpitalic-ϕsuperscript𝑝\phi\in\mathbb{R}^{p}italic_ϕ ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT represents the shared parameters across all tasks, and θipisubscript𝜃𝑖superscriptsubscript𝑝𝑖\theta_{i}\in\mathbb{R}^{p_{i}}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes the task-specific parameters for task Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The objective is to jointly optimize a composite loss function that combines the individual task losses, allowing the model to learn salient information across tasks. The multitask objective can be formulated as:

ϕ*,θ1*,,θn*=argminϕ,θ1,,θni=1nλiLi(ϕ,θi)superscriptitalic-ϕsuperscriptsubscript𝜃1superscriptsubscript𝜃𝑛subscriptitalic-ϕsubscript𝜃1subscript𝜃𝑛superscriptsubscript𝑖1𝑛subscript𝜆𝑖subscript𝐿𝑖italic-ϕsubscript𝜃𝑖\phi^{*},\theta_{1}^{*},\ldots,\theta_{n}^{*}=\arg\min_{\phi,\theta_{1},\ldots% ,\theta_{n}}\sum_{i=1}^{n}\lambda_{i}L_{i}(\phi,\theta_{i})italic_ϕ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_ϕ , italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ϕ , italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

where λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a task-specific weighting parameter.

In contrast to the conventional approach of obtaining optimal parameters directly from the MTL objective, this work diverges by leveraging the relationships inherent in the tasks observed during the training process. Rather than pursuing the singular optimization of task-specific parameters, this methodology capitalizes on the insights derived from task relationships. This leads to the concept of “Task Grou**”, where tasks are organized based on their observed relationships.

Definition 3.2.

(Task Grou**) Let 𝒯={Ti,i[n]}𝒯subscript𝑇𝑖𝑖delimited-[]𝑛\mathcal{T}=\{T_{i},i\in[n]\}caligraphic_T = { italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ [ italic_n ] } denote the set of tasks, and G={Gj,j[m]}𝐺subscript𝐺𝑗𝑗delimited-[]𝑚G=\{G_{j},j\in[m]\}italic_G = { italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j ∈ [ italic_m ] } represent the set of task groups. Task grou** endeavors to establish a map** based on task relationships such that for every task Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, there exists a group Gjsubscript𝐺𝑗G_{j}italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to which Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is assigned, ensuring the inclusion of at least one task in each group, resulting the best performance for each task.

The subsequent step involves obtaining optimal parameters at the level of task groups. The objective is:

minϕ,{θi}iGjλiLi(ϕ,θi)GjG,subscriptitalic-ϕsubscript𝜃𝑖subscript𝑖subscript𝐺𝑗subscript𝜆𝑖subscript𝐿𝑖italic-ϕsubscript𝜃𝑖for-allsubscript𝐺𝑗𝐺\min_{\phi,\{\theta_{i}\}}\sum_{i\in G_{j}}\lambda_{i}L_{i}(\phi,\theta_{i})\ % \ \forall G_{j}\in G,roman_min start_POSTSUBSCRIPT italic_ϕ , { italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ϕ , italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∀ italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_G , (1)

to enhance efficiency and effectiveness by jointly optimizing parameters within these groups.

4 Method

To infer task grou**s for subsequent optimization processes in Equation (1), we introduce a methodology for constructing transfer gains, as elucidated in Section 4.1, demonstrating efficacy in yielding theoretical outcomes without relying on underlying assumptions. Subsequently, we propose a versatile mathematical programming framework in Section 4.2 to flexibly accommodate various budget constraints. This formulation is instrumental in deriving the outcomes of task grou**. Furthermore, in Section 4.3, we conduct a detailed analysis of the computational complexity associated with collecting transfer gains, in comparison to TAG (Fifty et al., 2021).

4.1 Assumption-Free Transfer Gain

In this subsection, we first introduce the concept of proposed transfer gain, which is a pivotal component of our methodology.

Definition 4.1.

(Transfer Gain) For task TiTjsubscript𝑇𝑖subscript𝑇𝑗T_{i}\neq T_{j}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the task transfer gain from Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to Tjsubscript𝑇𝑗T_{j}italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT at training step t𝑡titalic_t is characterized by

𝒮ijt=1Lj(ϕ{i,j}t+1,θjt+1)Lj(ϕ{j}t+1,θjt+1)superscriptsubscript𝒮𝑖𝑗𝑡1subscript𝐿𝑗superscriptsubscriptitalic-ϕ𝑖𝑗𝑡1superscriptsubscript𝜃𝑗𝑡1subscript𝐿𝑗superscriptsubscriptitalic-ϕ𝑗𝑡1superscriptsubscript𝜃𝑗𝑡1\mathcal{S}_{i\rightarrow j}^{t}=1-\frac{L_{j}\left(\phi_{\{i,j\}}^{t+1},% \theta_{j}^{t+1}\right)}{L_{j}\left(\phi_{\{j\}}^{t+1},\theta_{j}^{t+1}\right)}caligraphic_S start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = 1 - divide start_ARG italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT { italic_i , italic_j } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT { italic_j } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) end_ARG (2)

In this equation, Ljsubscript𝐿𝑗L_{j}italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represents a task-specific metric, which can be, for example, the loss function or validation accuracy. ϕ{i,j}t+1superscriptsubscriptitalic-ϕ𝑖𝑗𝑡1\phi_{\{i,j\}}^{t+1}italic_ϕ start_POSTSUBSCRIPT { italic_i , italic_j } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT and θjt+1superscriptsubscript𝜃𝑗𝑡1\theta_{j}^{t+1}italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT represent the model parameters trained by the Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Tjsubscript𝑇𝑗T_{j}italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT at the subsequent training iteration. Then we can define the group transfer gain from any A𝒯𝐴𝒯A\subseteq\mathcal{T}italic_A ⊆ caligraphic_T to task Tjsubscript𝑇𝑗T_{j}italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT by

𝒮Ajt=1Lj(ϕA{j}t+1,θjt+1)Lj(ϕ{j}t+1,θjt+1)superscriptsubscript𝒮𝐴𝑗𝑡1subscript𝐿𝑗superscriptsubscriptitalic-ϕ𝐴𝑗𝑡1superscriptsubscript𝜃𝑗𝑡1subscript𝐿𝑗superscriptsubscriptitalic-ϕ𝑗𝑡1superscriptsubscript𝜃𝑗𝑡1\mathcal{S}_{A\rightarrow j}^{t}=1-\frac{L_{j}\left(\phi_{A\cup\{j\}}^{t+1},% \theta_{j}^{t+1}\right)}{L_{j}\left(\phi_{\{j\}}^{t+1},\theta_{j}^{t+1}\right)}caligraphic_S start_POSTSUBSCRIPT italic_A → italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = 1 - divide start_ARG italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_A ∪ { italic_j } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT { italic_j } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) end_ARG (3)

Furthermore, we extend this concept to the group transfer gain from any A𝒯𝐴𝒯A\subseteq\mathcal{T}italic_A ⊆ caligraphic_T to B𝒯𝐵𝒯B\subseteq\mathcal{T}italic_B ⊆ caligraphic_T by

𝒮ABt=jB𝒮Ajt,superscriptsubscript𝒮𝐴𝐵𝑡subscript𝑗𝐵superscriptsubscript𝒮𝐴𝑗𝑡\mathcal{S}_{A\rightarrow B}^{t}=\sum_{j\in B}\mathcal{S}_{A\rightarrow j}^{t},caligraphic_S start_POSTSUBSCRIPT italic_A → italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j ∈ italic_B end_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_A → italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , (4)

which allows us to measure the collective transfer knowledge between groups of tasks.

While our formulation in Equation (2) may bear a resemblance to prior work by Fifty et al. (2021), we will demonstrate the superior advantages of Equation (2) in the subsequent discussion. First, Fifty et al. (2021) defines the task affinity as 𝒵ijt=1Lj(θsit+1,θjt)Lj(ϕt,θjt)superscriptsubscript𝒵𝑖𝑗𝑡1subscript𝐿𝑗superscriptsubscript𝜃conditional𝑠𝑖𝑡1superscriptsubscript𝜃𝑗𝑡subscript𝐿𝑗superscriptitalic-ϕ𝑡superscriptsubscript𝜃𝑗𝑡\mathcal{Z}_{i\rightarrow j}^{t}=1-\frac{L_{j}\left(\theta_{s\mid i}^{t+1},% \theta_{j}^{t}\right)}{L_{j}\left(\phi^{t},\theta_{j}^{t}\right)}caligraphic_Z start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = 1 - divide start_ARG italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_s ∣ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG w.r.t. the loss function, reflecting the effects of training Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on Tjsubscript𝑇𝑗T_{j}italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, while 𝒮ijtsubscriptsuperscript𝒮𝑡𝑖𝑗\mathcal{S}^{t}_{i\rightarrow j}caligraphic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT measures the effects of training Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on training Tjsubscript𝑇𝑗T_{j}italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. This distinction allows us to establish the relationship between loss decrease and the value of 𝒮ijtsubscriptsuperscript𝒮𝑡𝑖𝑗\mathcal{S}^{t}_{i\rightarrow j}caligraphic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT without additional assumptions. Specifically, if 𝒮ijt>𝒮kjtsubscriptsuperscript𝒮𝑡𝑖𝑗subscriptsuperscript𝒮𝑡𝑘𝑗\mathcal{S}^{t}_{i\rightarrow j}>\mathcal{S}^{t}_{k\rightarrow j}caligraphic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT > caligraphic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k → italic_j end_POSTSUBSCRIPT, then it is naturally true that training {Ti,Tj}subscript𝑇𝑖subscript𝑇𝑗\{T_{i},T_{j}\}{ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } results in more loss decrease than training {Tk,Tj}subscript𝑇𝑘subscript𝑇𝑗\{T_{k},T_{j}\}{ italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }, as further summarized in the following observation:

Observation 4.2.

If 𝒮A1it>𝒮A2itsuperscriptsubscript𝒮normal-→subscript𝐴1𝑖𝑡superscriptsubscript𝒮normal-→subscript𝐴2𝑖𝑡\mathcal{S}_{A_{1}\rightarrow i}^{t}>\mathcal{S}_{A_{2}\rightarrow i}^{t}caligraphic_S start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT > caligraphic_S start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, then training task group A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT induces larger loss decrease than A2subscript𝐴2A_{2}italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for task i𝑖iitalic_i.

In contrast, Fifty et al. (2021) introduces restricted constraints, such as strong convexity on loss functions, to enforce this relationship. Secondly, in Fifty et al. (2021), group transfer gain is formulated as 𝒵{j,k}it=12(𝒵jit+𝒵kit)superscriptsubscript𝒵𝑗𝑘𝑖𝑡12superscriptsubscript𝒵𝑗𝑖𝑡superscriptsubscript𝒵𝑘𝑖𝑡\mathcal{Z}_{\{j,k\}\rightarrow i}^{t}=\frac{1}{2}\left(\mathcal{Z}_{j% \rightarrow i}^{t}+\mathcal{Z}_{k\rightarrow i}^{t}\right)caligraphic_Z start_POSTSUBSCRIPT { italic_j , italic_k } → italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( caligraphic_Z start_POSTSUBSCRIPT italic_j → italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + caligraphic_Z start_POSTSUBSCRIPT italic_k → italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). This formulation directly defines the group transfer gain on the task level, lacking the theoretical guarantee of the effectiveness. In this work, we establish the connection between task and group transfer gain in Proposition 4.3, leading to both the theoretical advantages and valid empirical operations at the implementation level.

Proposition 4.3.

Assume ifor-all𝑖\forall i∀ italic_i, Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an L-Lipschitz loss function and 0<CLj0𝐶subscript𝐿𝑗0<C\leq L_{j}0 < italic_C ≤ italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, then we have:

|𝒮Ajt1|A|iA𝒮ijt|ηt(1+|A|)L2Csuperscriptsubscript𝒮𝐴𝑗𝑡1𝐴subscript𝑖𝐴superscriptsubscript𝒮𝑖𝑗𝑡subscript𝜂𝑡1𝐴superscript𝐿2𝐶|\mathcal{S}_{A\rightarrow j}^{t}-\frac{1}{|A|}\sum_{i\in A}\mathcal{S}_{i% \rightarrow j}^{t}|\leq\frac{\eta_{t}(1+|A|)L^{2}}{C}| caligraphic_S start_POSTSUBSCRIPT italic_A → italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG | italic_A | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A end_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | ≤ divide start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( 1 + | italic_A | ) italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C end_ARG

where ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the learning rate at time step t.

We leave the proof in Appendix A.

Remark 4.4.

In practical scenarios, the magnitude of the task-specific loss function Ljsubscript𝐿𝑗L_{j}italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT resides within a reasonable range sufficiently distant from zero. This can be achieved, for instance, through the incorporation of regularization techniques such as regularization terms or early stop** mechanisms. Specifically, the condition 0<C<Lj,jformulae-sequence0𝐶subscript𝐿𝑗for-all𝑗0<C<L_{j},\forall j0 < italic_C < italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_j can be met through these regularization strategies. Additionally, the learning rate parameter ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is often set to small and gradually decreasing values during the training process. Furthermore, it is worth noting that an upper bound on the group size denoted as |A|𝐴|A|| italic_A | can usually be constrained to a reasonable magnitude. This limitation is often dictated by memory constraints or other practical considerations, and it can be bounded by a constant. Such constraints ensure that the group size remains within manageable limits, aligning with available resources and memory budgets. Consequently, it is reasonable to posit that the group transfer gain exhibits such an approximate characteristic.

Based on these facts, 𝒮ijtsuperscriptsubscript𝒮𝑖𝑗𝑡\mathcal{S}_{i\rightarrow j}^{t}caligraphic_S start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT can serve as a measure of the intrinsic signals that reveal the relationship between tasks during training. Furthermore, we introduce the concept of cumulative transfer gain and group transfer across the training as follows:

𝒮ij=1Tt=1T𝒮ijt,𝒮Aj=1|A|iA𝒮ijformulae-sequencesubscript𝒮𝑖𝑗1𝑇superscriptsubscript𝑡1𝑇superscriptsubscript𝒮𝑖𝑗𝑡subscript𝒮𝐴𝑗1𝐴subscript𝑖𝐴subscript𝒮𝑖𝑗\mathcal{S}_{i\rightarrow j}=\frac{1}{T}{\sum_{t=1}^{T}\mathcal{S}_{i% \rightarrow j}^{t}},\ \mathcal{S}_{A\rightarrow j}=\frac{1}{|A|}\sum_{i\in A}% \mathcal{S}_{i\rightarrow j}caligraphic_S start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , caligraphic_S start_POSTSUBSCRIPT italic_A → italic_j end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_A | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A end_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT (5)

to quantify the effects of training group A𝐴Aitalic_A on task j𝑗jitalic_j across the training.

4.2 Genetic and Flexible Task Grou** Framework

After obtaining the cumulative transfer gain across the training from Equation (5), the remaining question is how to accurately determine the task grou** results. Practical applications often require customization based on various requirements, such as resource constraints. Therefore, it is crucial to formulate a general and flexible task grou** framework. In contrast to previous approaches (Zamir et al., 2018; Fifty et al., 2021; Standley et al., 2020) that utilize Binary Integer Programming and Branch & Bound methods , we propose a mathematical programming framework which can be transformed to Mixed Integer Programming formulation. This involves setting binary variables Xijsubscript𝑋𝑖𝑗X_{ij}italic_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, where i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ] and j[m]𝑗delimited-[]𝑚j\in[m]italic_j ∈ [ italic_m ], with Xij=1subscript𝑋𝑖𝑗1X_{ij}=1italic_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 indicating the assignment of task i𝑖iitalic_i to group j𝑗jitalic_j, and Xij=0subscript𝑋𝑖𝑗0X_{ij}=0italic_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0 otherwise. Notably, Xjnsubscript𝑋absent𝑗superscript𝑛X_{\cdot j}\in\mathbb{R}^{n}italic_X start_POSTSUBSCRIPT ⋅ italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT indicates j𝑗jitalic_j-th column of X𝑋Xitalic_X, 𝟏1\mathbf{1}bold_1 represents a vector composed entirely of ones with an adaptable dimension, element Bijsubscript𝐵𝑖𝑗B_{ij}italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT in Bn×m𝐵superscript𝑛𝑚B\in\mathbb{R}^{n\times m}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_m end_POSTSUPERSCRIPT denotes the budget of Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT assigned to Gjsubscript𝐺𝑗G_{j}italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, 𝐛m𝐛superscript𝑚\mathbf{b}\in\mathbb{R}^{m}bold_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT represents the maximum budget for each group, and “direct-product\odot” signifies the element-wise product between matrices, resulting the formulation as follows:

maxXsubscript𝑋\displaystyle\max_{X}\quadroman_max start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT j=1n1𝟏TXjXjT𝒮Xjsuperscriptsubscript𝑗1𝑛1superscript1𝑇subscript𝑋absent𝑗superscriptsubscript𝑋absent𝑗𝑇𝒮subscript𝑋absent𝑗\displaystyle\sum_{j=1}^{n}\frac{1}{\mathbf{1}^{T}X_{\cdot j}}X_{\cdot j}^{T}% \mathcal{S}X_{\cdot j}∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG bold_1 start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT ⋅ italic_j end_POSTSUBSCRIPT end_ARG italic_X start_POSTSUBSCRIPT ⋅ italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_S italic_X start_POSTSUBSCRIPT ⋅ italic_j end_POSTSUBSCRIPT (6)
s.t. XT𝟏𝟏superscript𝑋𝑇11\displaystyle{X^{T}\mathbf{1}\geq\mathbf{1}}italic_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_1 ≥ bold_1
X𝟏𝟏𝑋11\displaystyle X\mathbf{1}\geq\mathbf{1}italic_X bold_1 ≥ bold_1
(BX)T𝟏𝐛superscriptdirect-product𝐵𝑋𝑇1𝐛\displaystyle(B\odot X)^{T}\mathbf{1}\leq\mathbf{b}( italic_B ⊙ italic_X ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_1 ≤ bold_b
Xj1Xj221,j1j2formulae-sequencesuperscriptnormsubscript𝑋absentsubscript𝑗1subscript𝑋absentsubscript𝑗221subscript𝑗1subscript𝑗2\displaystyle||X_{\cdot j_{1}}-X_{\cdot j_{2}}||^{2}\geq 1,j_{1}\neq j_{2}| | italic_X start_POSTSUBSCRIPT ⋅ italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT ⋅ italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ 1 , italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
X{0,1}n×n𝑋superscript01𝑛𝑛\displaystyle{X\in\{0,1\}^{n\times n}}italic_X ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT

The primary objective of the task grou** problem is to compute the aggregate impact of training each individual group. The quadratic form XjT𝒮Xjsuperscriptsubscript𝑋absent𝑗𝑇𝒮subscript𝑋absent𝑗X_{\cdot j}^{T}\mathcal{S}X_{\cdot j}italic_X start_POSTSUBSCRIPT ⋅ italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_S italic_X start_POSTSUBSCRIPT ⋅ italic_j end_POSTSUBSCRIPT arises from the approximation as stated in Proposition 4.3. Several key constraints are imposed to ensure the meaningful solution of this problem. The constraint XT𝟏𝟏superscript𝑋𝑇11{X^{T}\mathbf{1}\geq\mathbf{1}}italic_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_1 ≥ bold_1 is established to guarantee that each group contains at least one task, while the constraint X𝟏𝟏𝑋11{X\mathbf{1}\geq\mathbf{1}}italic_X bold_1 ≥ bold_1 ensures that all tasks are incorporated into the grou** outcomes. Additionally, diverse budgetary constraints can be introduced by incorporating the inequality (BX)T𝟏𝐛superscriptdirect-product𝐵𝑋𝑇1𝐛(B\odot X)^{T}\mathbf{1}\leq\mathbf{b}( italic_B ⊙ italic_X ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_1 ≤ bold_b. Moreover, it is imperative that the resulting groups remain distinct, and this distinctiveness is ensured by the condition Xj1Xj221superscriptnormsubscript𝑋absentsubscript𝑗1subscript𝑋absentsubscript𝑗221||X_{\cdot j_{1}}-X_{\cdot j_{2}}||^{2}\geq 1| | italic_X start_POSTSUBSCRIPT ⋅ italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT ⋅ italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ 1. Remarkably, recent advancements in powerful mathematical programming solvers have significantly enhanced the efficiency with which these problems can be resolved through appropriate transformations, as elucidated in Appendix B.

Refer to caption
(a) CelebA
Refer to caption
(b) ETTm1
Figure 1: Demonstration of Total Test Error across Grou** Methods. This figure presents the variation in total test error for different grou** methods, segmented by data splits 2, 3, and 4. The dotted and dashed horizontal lines indicate the Single Task Learning (STL) and the best Multi-Task Learning (MTL) benchmarks, respectively. For CelebA dataset, the Naive-MTL is the best MTL benchmark, and PCGrad is the best MTL benchmark for ETTm1.

4.3 Discussion on The Computation Complexity

In this part, we discuss the computational complexity of the proposed method for collecting transfer gains and compare it to the most related work, TAG (Fifty et al., 2021). We begin by defining the relevant notations: For each task Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the feed-forward and backward computational costs of the neural networks are denoted as isubscript𝑖\mathcal{F}_{i}caligraphic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and isubscript𝑖\mathcal{B}_{i}caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, respectively. The average computation costs for these tasks are represented by \mathcal{F}caligraphic_F and \mathcal{B}caligraphic_B, where =1ni=1ni1𝑛superscriptsubscript𝑖1𝑛subscript𝑖\mathcal{B}=\frac{1}{n}\sum_{i=1}^{n}\mathcal{B}_{i}caligraphic_B = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and =1ni=1ni1𝑛superscriptsubscript𝑖1𝑛subscript𝑖\mathcal{F}=\frac{1}{n}\sum_{i=1}^{n}\mathcal{F}_{i}caligraphic_F = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Based on these notations, the computation costs for TAG and our proposed method are (n2+n)+nsuperscript𝑛2𝑛𝑛(n^{2}+n)\mathcal{F}+n\mathcal{B}( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_n ) caligraphic_F + italic_n caligraphic_B and (2n2+n)+n2superscript𝑛2𝑛𝑛(2n^{2}+n)\mathcal{F}+n\mathcal{B}( 2 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_n ) caligraphic_F + italic_n caligraphic_B, respectively. Both methods have the same order of computational complexity with respect to the number of tasks. While our method involves an additional n2superscript𝑛2n^{2}italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT feedforward computations, it’s worth emphasizing that feedforward costs are typically substantially smaller than backward costs in practice. Consequently, the impact of these additional feedforward computations is not significant, especially when considering the theoretical advantages our method offers. Moreover, we demonstrate in Appendix D that employing a lazy collection strategy for transfer gains leads to improved performance. Additionally, the analysis reveals that the cumulative transfer gain varies across different training phases, yielding diverse outcomes. These insights allow us to substantially reduce computational costs.

5 Experiments

Table 1: Comparative Demonstration of Single-Task Training (STL), Multi-Task Learning (MTL), and Task Grou** Approaches on the COP Benchmark. The table highlights the following representations: STL’s strong baseline in orange, the optimal baseline in red for each split, results surpassing STL in blue, and the best outcomes (excluding the optimal baseline) in bold.
Method TSP20 TSP50 CVRP20 CVRP50 OP20 OP50 Tot. Gap
STL 0.0170.0170.0170.017% 0.2770.2770.2770.277% 0.5340.5340.5340.534% 1.7801.7801.7801.780% 0.8490.849-0.849- 0.849% 1.1171.1171.1171.117% 2.876%percent2.8762.876\%2.876 %
MTL Naive-MTL 0.022%percent0.0220.022\%0.022 % 0.469%percent0.4690.469\%0.469 % 0.522%percent0.5220.522\%0.522 % 2.070%percent2.0702.070\%2.070 % 0.805%percent0.805-0.805\%- 0.805 % 1.270%percent1.2701.270\%1.270 % 3.548%percent3.5483.548\%3.548 %
Bandit-MTL 0.021%percent0.0210.021\%0.021 % 0.882%percent0.8820.882\%0.882 % 0.690%percent0.6900.690\%0.690 % 2.511%percent2.5112.511\%2.511 % 0.865%percent0.865-0.865\%- 0.865 % 2.114%percent2.1142.114\%2.114 % 5.354%percent5.3545.354\%5.354 %
PCGrad 0.028%percent0.0280.028\%0.028 % 0.708%percent0.7080.708\%0.708 % 0.605%percent0.6050.605\%0.605 % 2.411%percent2.4112.411\%2.411 % 0.689%percent0.689-0.689\%- 0.689 % 1.756%percent1.7561.756\%1.756 % 4.819%percent4.8194.819\%4.819 %
UW 0.042%percent0.0420.042\%0.042 % 0.362%percent0.3620.362\%0.362 % 0.412%percent0.4120.412\%0.412 % 1.703%percent1.7031.703\%1.703 % 0.665%percent0.665-0.665\%- 0.665 % 1.153%percent1.1531.153\%1.153 % 3.007%percent3.0073.007\%3.007 %
LS 0.020%percent0.0200.020\%0.020 % 0.476%percent0.4760.476\%0.476 % 0.512%percent0.5120.512\%0.512 % 2.084%percent2.0842.084\%2.084 % 0.792%percent0.792-0.792\%- 0.792 % 1.197%percent1.1971.197\%1.197 % 3.498%percent3.4983.498\%3.498 %
Nash-MTL 0.038%percent0.0380.038\%0.038 % 0.322%percent0.3220.322\%0.322 % 0.421%percent0.4210.421\%0.421 % 1.847%percent1.8471.847\%1.847 % 0.873%percent0.873-0.873\%- 0.873 % 1.279%percent1.2791.279\%1.279 % 3.034%percent3.0343.034\%3.034 %
Task Grou** Optimal-3 0.0220.0220.0220.022% 0.3070.3070.3070.307% 0.4040.4040.4040.404% 1.6921.6921.6921.692% 0.8610.861-0.861- 0.861% 0.9290.9290.9290.929% 2.4922.4922.4922.492%
Random-3 0.031%percent0.0310.031\%0.031 % 0.535%percent0.5350.535\%0.535 % 0.416%percent0.4160.416\%0.416 % 1.741%percent1.7411.741\%1.741 % 0.764%percent0.764-0.764\%- 0.764 % 1.279%percent1.2791.279\%1.279 % 3.237%percent3.2373.237\%3.237 %
TAG-3 0.021%percent0.0210.021\%0.021 % 0.324%percent0.3240.324\%0.324 % 0.404%percent0.4040.404\%0.404 % 1.692%percent1.6921.692\%1.692 % 0.771%percent0.771-0.771\%- 0.771 % 1.104%percent1.1041.104\%1.104 % 2.774%percent2.774\mathbf{2.774}\%bold_2.774 %
Ours-3 0.021%percent0.0210.021\%0.021 % 0.324%percent0.3240.324\%0.324 % 0.404%percent0.4040.404\%0.404 % 1.692%percent1.6921.692\%1.692 % 0.771%percent0.771-0.771\%- 0.771 % 1.104%percent1.1041.104\%1.104 % 2.774%percent2.774\mathbf{2.774}\%bold_2.774 %
Optimal-4 0.0270.0270.0270.027% 0.2770.2770.2770.277% 0.4040.4040.4040.404% 1.6921.6921.6921.692% 0.8970.897-0.897- 0.897% 0.9290.9290.9290.929% 2.4312.4312.4312.431%
Random-4 0.024%percent0.0240.024\%0.024 % 0.473%percent0.4730.473\%0.473 % 0.681%percent0.6810.681\%0.681 % 2.125%percent2.1252.125\%2.125 % 0.852%percent0.852-0.852\%- 0.852 % 1.072%percent1.0721.072\%1.072 % 3.522%percent3.5223.522\%3.522 %
TAG-4 0.021%percent0.0210.021\%0.021 % 0.324%percent0.3240.324\%0.324 % 0.504%percent0.5040.504\%0.504 % 2.028%percent2.0282.028\%2.028 % 0.780%percent0.780-0.780\%- 0.780 % 1.319%percent1.3191.319\%1.319 % 2.757%percent2.7572.757\%2.757 %
Ours-4 0.0170.0170.0170.017% 0.2770.2770.2770.277% 0.4040.4040.4040.404% 1.6921.6921.6921.692% 0.7710.771-0.771- 0.771% 1.1041.1041.1041.104% 2.723%percent2.723\mathbf{2.723}\%bold_2.723 %
Optimal-5 0.0170.0170.0170.017% 0.2770.2770.2770.277% 0.4040.4040.4040.404% 1.6921.6921.6921.692% 0.8970.897-0.897- 0.897% 0.9290.9290.9290.929% 2.4222.4222.4222.422%
Random-5 0.038%percent0.0380.038\%0.038 % 0.444%percent0.4440.444\%0.444 % 0.472%percent0.4720.472\%0.472 % 1.992%percent1.9921.992\%1.992 % 0.761%percent0.761-0.761\%- 0.761 % 1.233%percent1.2331.233\%1.233 % 3.419%percent3.4193.419\%3.419 %
TAG-5 0.0210.0210.0210.021% 0.3070.3070.3070.307% 0.4040.4040.4040.404% 1.6921.6921.6921.692% 0.7710.771-0.771- 0.771% 1.1041.1041.1041.104% 2.757%percent2.7572.757\%2.757 %
Ours-5 0.0210.0210.0210.021% 0.3240.3240.3240.324% 0.4040.4040.4040.404% 1.6921.6921.6921.692% 0.8400.840-0.840- 0.840% 1.0951.0951.0951.095% 2.696%percent2.696\mathbf{2.696}\%bold_2.696 %

In this section, we present the experimental evaluation of our novel task grou** approach, demonstrating its effectiveness across diverse domains. Our experiments are strategically designed to validate the versatility and superiority of our method in three distinct scenarios: (1) Computer vision tasks on CelebA dataset; (2) Combinatorial optimization benchmarks from Wang & Yu (2023); and (3) Time series tasks following Song et al. (2022). The metrics utilized across all scenarios are calculated based on the best outcomes for each task within its respective groups.

Baselines Our experimental evaluation involves a comprehensive comparison against a range of established methods. These baselines are categorized into three distinct types, each representing a different approach to task learning and organization in MTL: (1) Single Task Learning (STL); (2) MTL methods: We consider a variety of MTL methods that employ different strategies for joint task learning, including: Naive-MTL, Bandit-MTL (Mao et al., 2021), PCGrad (Yu et al., 2020), Nash-MTL (Navon et al., 2022), Uncertainty-Weighting (UW) (Kendall et al., 2018) and LinearScale; (3) Task Grou** Methods: This category includes methods that explicitly group tasks based on certain criteria or strategies. The common task grou** baselines are: Random Policy: Tasks are grouped randomly and results are taken the average for 10 repeats; Optimal Policy: Obtained by enumeration, providing an upper benchmark for performance; TAG (Fifty et al., 2021): A known task grou** SOTA method that groups tasks based on their affinity. Specific differences and adaptations of these baselines will be detailed in the corresponding sections for each scenario.

5.1 Computer Vision Tasks

In this part, we investigate the implementation of our task grou** approach in the field of computer vision, specifically using the CelebA dataset for our analysis. The effectiveness of our method is quantified by measuring the total classification error rates across all tasks. To ensure experimental consistency, we adhere to the network architecture and training hyperparameters as specified in the TAG (Fifty et al., 2021), employing the same 9 tasks as in TAG. In addition to this, we include several additional baselines for a more thorough and comparative evaluation, such as High Order Approximation (HOA) (Standley et al., 2020), cosine similarity (CS), and the optimal baseline (OPT) as used in TAG.

Figure 0(a) offers a comprehensive evaluation of various baselines in terms of the Total Error metric under different group split conditions. It is observed that naive multi-task learning exhibits superior performance compared to single-task learning, and most grou** methods result in a further reduction in total error. This indicates the presence of complex underlying relationships between the tasks in this dataset, suggesting that our grou** methods can identify specific combinations that enhance overall performance. Our method consistently outperforms all baselines in terms of Total Error across all splits, with the only exception being the ‘OPT’ strategy in TAG (Fifty et al., 2021). Moreover, our method effectively leverages the granularity afforded by increased splits, demonstrating a decrease in total error with a greater number of splits, as elaborated in Appendix C.

5.2 Combinatorial Optimization Benchmarks

In this section, we focus on combinatorial optimization, specifically testing six tasks: TSP20, TSP50, CVRP20, CVRP50, OP20, and OP50. These tasks encompass various scales of the Traveling Salesman Problem (TSP), Capacitated Vehicle Routing Problem (CVRP), and Orienteering Problem (OP), offering a wide spectrum of combinatorial challenges. The neural solver used for these tasks is the POMO (Kwon et al., 2020) framework, noted for its effectiveness in addressing combinatorial optimization problems. Consistency in experimental conditions is maintained by aligning the training parameters with those outlined in the POMO study. Performance evaluation is centered on the average optimality gap, defined mathematically as:

Gap=1Ni=1N(1solver(i)gt(i))×100%𝐺𝑎𝑝1𝑁superscriptsubscript𝑖1𝑁1solversubscript𝑖gtsubscript𝑖percent100Gap=\frac{1}{N}\sum_{i=1}^{N}(1-\frac{\text{solver}(\mathcal{I}_{i})}{\text{gt% }(\mathcal{I}_{i})})\times 100\%italic_G italic_a italic_p = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( 1 - divide start_ARG solver ( caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG gt ( caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ) × 100 %

evaluated over N=10,000𝑁10000N=10,000italic_N = 10 , 000 instances for each task, to measure the solution’s deviation from the ground truth obtained from Gurobi (Gurobi Optimization, LLC, 2023), with results detailed in Table 1.

In the presented comparative analysis in Table 1, Single-Task Learning (STL) demonstrates a robust baseline, outperforming all Multi-Task Learning (MTL) methods with respect to the Total Gap metric. Within the domain of Task Grou** methods, both TAG and our proposed method have shown the capability to surpass the STL baseline in certain aspects. Notably, our method consistently achieves the best performance among non-optimal baselines across each grou** strategy, indicating its efficacy in handling multiple related tasks simultaneously. As we consider the performance trends across different task grou**s, it is observed that the efficacy of Optimal, TAG, and our method improves as the splits become larger. This trend suggests that the tasks within the COP benchmark exhibit high positive transfer potential.

In the context of 3-split task grou**, both TAG and our method achieve logical grou**s of tasks, pairing (TSP20, TSP50), (CVRP20, CVRP50), and (OP20, OP50) together. This reflects an intuitive understanding that tasks of the same types benefit from being trained in concert. Intriguingly, the optimal grou**s: (TSP20, TSP50, CVRP20), (CVRP20, CVRP50) and (TSP50, CVRP20, OP20,OP50), do not align with these intuitive pairings, suggesting that there may be non-obvious correlations that, when leveraged, could lead to even greater improvements in task performance.

Refer to caption
(a) Maximal Group Constraint
Refer to caption
(b) Minimal Group Constraint
Figure 2: Comparative Performance under Maximal and Minimal Group Size Constraints - The figure delineates the performance of our task grou** method against random policy. Metrics such as Total Error for the CelebA dataset, Total Gap for combinatorial optimization problems (COP), and Total Mean Absolute Error (MAE) for time series forecasting tasks are evaluated across a range of group sizes, illustrating the adaptability of our method to both maximal and minimal size constraints.

5.3 Time Series Tasks

In this section, the focus is on time series forecasting using the ETTm1 dataset (Wu et al., 2021), comprising seven distinct electric load time series. Following the methodology in (Song et al., 2022), each series within the ETTm1 dataset is approached as an individual forecasting task. The effectiveness of these tasks is assessed using the mean absolute error’s (MAE) relative reduction as the evaluation metric. This study employs a model architecture based on the AutoFormer framework (Wu et al., 2021), specifically designed for the intricacies and predictive challenges of multi-variate time series data.

Figure 0(b) shows that all methods exhibit significant variance across experiments, likely attributed to the inherent complexity of time series forecasting. This uncertainty presents additional challenges in accruing more precise transfer gains. Nevertheless, our method surpasses STL, MTL, and TAG in these demanding conditions, with TAG’s performance equivalent to random task grou**. Moreover, our method demonstrates the ability to produce grou**s with reduced errors when using a greater number of splits, as elaborated in Appendix C.

5.4 Constraint-Aware Task Grou**

In practical scenarios, constraints like limited computational resources, data availability, and group size requirements are prevalent. Tasks often have different data acquisition costs, necessitating budget management. In distributed learning, with resources spread across nodes, adhering to group size limits is essential. Our mathematical programming approach in Formulation 6 effectively tackles the task grou** challenge under these constraints, incorporating customized limitations to meet the practical demands of real-world applications. In this section, we address the constraint that group sizes must fall within a specified range, as shown in:

𝐦𝟏XT𝟏𝐦𝟐,subscript𝐦1superscript𝑋𝑇1subscript𝐦2\mathbf{m_{1}}\leq X^{T}\mathbf{1}\leq\mathbf{m_{2}},bold_m start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT ≤ italic_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_1 ≤ bold_m start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT , (7)

where 𝐦𝟏,𝐦𝟐msubscript𝐦1subscript𝐦2superscript𝑚\mathbf{m_{1}},\mathbf{m_{2}}\in\mathbb{R}^{m}bold_m start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_m start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, applying element-wise comparison. Our experiments across computer vision, combinatorial optimization (COP), and time series scenarios demonstrate the model’s capability to adhere to these size constraints. We benchmark our method against random sampling, conducted 10 times under the same constraints, detailing the experimental setup subsequently.

Maximal Size Constraint Maximum Group Size Constraint: For this constraint, we define 𝐦𝟏subscript𝐦1\mathbf{m_{1}}bold_m start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT as 𝟏1\mathbf{1}bold_1 and 𝐦𝟐subscript𝐦2\mathbf{m_{2}}bold_m start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT as M𝟏𝑀1M\mathbf{1}italic_M bold_1, where M𝑀Mitalic_M represents the maximum permissible group size, dictated by memory limitations. In the context of the CelebA dataset, which comprises 9 tasks, we set M𝑀Mitalic_M within the range {5,6,7,8,9}56789\{5,6,7,8,9\}{ 5 , 6 , 7 , 8 , 9 }. For the COP benchmark involving 6 tasks, M𝑀Mitalic_M is selected from {3,4,5,6}3456\{3,4,5,6\}{ 3 , 4 , 5 , 6 }, and for time series tasks comprising 7 tasks, we choose M𝑀Mitalic_M from {4,5,6,7}4567\{4,5,6,7\}{ 4 , 5 , 6 , 7 }.

The experimental results, as illustrated in Figure 1(a), exhibit a uniform trend across varying cases and group sizes. For example, in the CelebA dataset, our method consistently exhibits superior performance compared to the random policy. This is evident in the progressively lower Total Error rates as the maximum group size increases. Similarly, in the COP benchmarks and time series tasks with their respective maximum group size constraints, our method maintains or even enhances its performance. This consistent trend, as seen across various maximal group sizes and tasks, underscores the robustness and adaptability of our approach in accommodating changing task constraints.

Minimum Group Size Constraint In this case, the vectors 𝐦𝟏subscript𝐦1\mathbf{m_{1}}bold_m start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT and 𝐦𝟐subscript𝐦2\mathbf{m_{2}}bold_m start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT are defined as m𝟏𝑚1m\mathbf{1}italic_m bold_1 and Mmax𝟏subscript𝑀max1M_{\text{max}}\mathbf{1}italic_M start_POSTSUBSCRIPT max end_POSTSUBSCRIPT bold_1, respectively, where Mmaxsubscript𝑀maxM_{\text{max}}italic_M start_POSTSUBSCRIPT max end_POSTSUBSCRIPT denotes the total number of tasks specific to each case: 9 for the CelebA dataset, 6 for COP, and 7 for time series tasks. This formulation encapsulates the requirement that device utilization at each node surpasses a certain threshold, ensuring an efficient usage rate. For all three scenarios, we set m𝑚mitalic_m to be in the set {1,2,3,4}1234\{1,2,3,4\}{ 1 , 2 , 3 , 4 }.

Results can be seen in Figure 1(b). In the CelebA dataset, as the minimal group size increases, our method consistently exhibits lower Total Error compared to the random policy, indicating improved efficiency in handling larger group sizes. Similarly, in the COP benchmarks, the Total Gap percentage decreases as the group size grows, highlighting our method’s effectiveness in optimizing under tighter constraints. While the performance of our method in the time series tasks under minimal group size constraints does not match the levels achieved in the CelebA and COP benchmarks, it still notably outperforms the random policy.

Refer to caption

Figure 3: Performance comparison of TAG-MP, Ours-MP, and Ours-BB across multiple grou** splits on CelebA, COP, and ETIm1 benchmarks. The gray dashed line represents the intractable limit beyond which Ours-BB could not be computed due to computational constraints.

5.5 Ablation Study

In this section, we conduct an ablation study to underscore the significance of the novel transfer gain and the mathematical programming framework, in comparison to TAG (Fifty et al., 2021). To demonstrate the significance of our transfer gain metric, we apply the mathematical framework from formulation 6 for task grou** using 𝒮ijtsuperscriptsubscript𝒮𝑖𝑗𝑡\mathcal{S}_{i\rightarrow j}^{t}caligraphic_S start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as per formulation 4.1, and TAG’s affinity 𝒵ijtsuperscriptsubscript𝒵𝑖𝑗𝑡\mathcal{Z}_{i\rightarrow j}^{t}caligraphic_Z start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, termed “Ours-MP” with our method and “TAG-MP” with TAG’s. We also assess our framework’s performance against TAG’s branch and bound techniques, introducing “Ours-BB” for Branch & Bound method guided by the transfer gain 𝒮ijtsuperscriptsubscript𝒮𝑖𝑗𝑡\mathcal{S}_{i\rightarrow j}^{t}caligraphic_S start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

Ablation on Transfer Gain
The first aspect of the comparison focuses on the transfer gain between 𝒮ijtsuperscriptsubscript𝒮𝑖𝑗𝑡\mathcal{S}_{i\rightarrow j}^{t}caligraphic_S start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and 𝒵ijtsuperscriptsubscript𝒵𝑖𝑗𝑡\mathcal{Z}_{i\rightarrow j}^{t}caligraphic_Z start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, as implemented in Ours-MP and TAG-MP, respectively. Results in Figure 3 reveal that Ours-MP consistently surpasses TAG-MP across all benchmarks. This is evidenced by its lower Total Error in the CelebA benchmark, minimized Total Gap percentage in COP and reduced Total MAE in ETIm1 with an increasing number of grou** splits. TAG-MP’s declining performance, particularly in the CelebA dataset, suggests that the transfer gain proposed in our methodology more accurately captures the task relationships than the one proposed in TAG, under a consistent task grou** solver.

Ablation on Task Grou** Solver
In Figure 3, Ours-MP demonstrates better performance and computational efficiency, providing results across all grou** splits, while Ours-BB is computationally limited, unable to process beyond a certain number of grou**s as highlighted by the gray dashed line for CelebA and ETIm1.

Table 2: Comparative Analysis of Time Efficiency Across Different Task Splits and Dataset, for CelebA with 9 tasks, COP with 6 tasks and ETTm1 with 7 tasks. ”s”, ”m” and ”h” stand for seconds, minutes and hours, respectively. ”-” indicates that the method fails to solve the problem within an 8-hour time limit.
CelebA COP ETTm1
Splits BB MP BB MP BB MP
2 0.312s 0.907s 0.007s 0.085s 0.024s 0.122s
3 1.93m 2.255s 0.279s 0.713s 2.005s 1.078s
4 - 4.009s 5.955s 0.899s 1.63m 1.677s
5 - 7.261s 1.48m 1.621s 1.02h 2.218s
6 - 21.44s 16.96m 5.013s - 4.158s

We demonstrate the detailed time-cost for Ours-MP and Ours-BB in Table 2, named with MP and BB in short, respectively. In particular, for CelebA, our method demonstrates exceptional coverage, capable of handling all scenarios up to 6 splits efficiently within 30 seconds. When examining the COP dataset, it is evident that the time cost for the BB method increases substantially, potentially exponentially, with larger splits, or fails to deliver results within an 8-hour limit. This trend is also observable across datasets; within the same split categories of 2 and 3, the time cost for BB grows drastically with the increment of task numbers. This highlights our method’s scalability and robustness in managing increased computational demands across varying scenarios.

6 Conclusions

In conclusion, this work introduces an innovative approach to task grou** in Multi-Task Learning (MTL), marking a substantial advancement over existing methodologies. Our novel method principally embodies two breakthroughs: a robust transfer gains construction independent of limiting assumptions, and a flexible mathematical programming formulation apt for solving task grou** issues. Empirical validation verifiably asserts the superiority of our approach across various fields, surpassing a wide range of baselines not only in terms of performance but also in flexibility and efficiency under real-world conditions. Such results highlight the capacity of our method to markedly improve the applicability and performance of MTL models across diverse and resource-restricted environments.

7 Broader Impact

This research on Multi-Task Learning (MTL) presents a novel approach to task grou** that achieves significant efficiency gains in both academic and industrial settings. It stands out for its flexibility in adapting to diverse and realistic demands, which is crucial for managing complex tasks efficiently. This adaptability is particularly important in the context of growing computational demands in large-scale data analysis. However, the approach also brings forth ethical considerations. The interpretation of inter-task affinities, if not handled cautiously, could lead to incorrect associations or biases, especially in sensitive contexts. It is imperative to recognize and address these risks to prevent potential misuse and ensure the responsible application of this technology. Despite these considerations, the method’s ability to considerably reduce computational demands, while catering to specific requirements and maintaining high accuracy, is a noteworthy advancement in MTL.

References

  • Byun et al. (2022) Byun, H.-E., Kim, B., and Lee, J. H. Multi-step lookahead bayesian optimization with active learning using reinforcement learning and its application to data-driven batch-to-batch optimization. Computers & Chemical Engineering, 167:107987, 2022.
  • Caruana (1997) Caruana, R. Multitask learning. Machine learning, 28:41–75, 1997.
  • Ding et al. (2023) Ding, C., Lu, Z., Wang, S., Cheng, R., and Boddeti, V. N. Mitigating task interference in multi-task learning via explicit task routing with non-learnable primitives. CVPR, pp.  7756–7765, 2023.
  • Dwivedi & Roig (2019) Dwivedi, K. and Roig, G. Representation similarity analysis for efficient task taxonomy & transfer learning. In CVPR, pp.  12387–12396, 2019.
  • Fifty et al. (2021) Fifty, C., Amid, E., Zhao, Z., Yu, T., Anil, R., and Finn, C. Efficiently identifying task grou**s for multi-task learning. Advances in Neural Information Processing Systems, 34:27503–27516, 2021.
  • Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pp.  1126–1135, 2017.
  • Guangyuan et al. (2022) Guangyuan, S., Li, Q., Zhang, W., Chen, J., and Wu, X.-M. Recon: Reducing conflicting gradients from the root for multi-task learning. In ICLR, 2022.
  • Guo et al. (2020) Guo, P., Lee, C.-Y., and Ulbricht, D. Learning to branch for multi-task learning. In ICML, pp.  3854–3863, 2020.
  • Gurobi Optimization, LLC (2023) Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023.
  • Huang et al. (2022) Huang, Z., Rao, M., Raju, A., Zhang, Z., Bui, B., and Lee, C. Mtl-slt: multi-task learning for spoken language tasks. In Proceedings of the 4th Workshop on NLP for Conversational AI, pp.  120–130, 2022.
  • Javaloy & Valera (2022) Javaloy, A. and Valera, I. Rotograd: Gradient homogenization in multitask learning. In ICLR, 2022.
  • Kang et al. (2011) Kang, Z., Grauman, K., and Sha, F. Learning with whom to share in multi-task feature learning. In ICML, pp.  521–528, 2011.
  • Kendall et al. (2018) Kendall, A., Gal, Y., and Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  7482–7491, 2018.
  • Kumar & Daume III (2012) Kumar, A. and Daume III, H. Learning task grou** and overlap in multi-task learning. arXiv preprint arXiv:1206.6417, 2012.
  • Kwon et al. (2020) Kwon, Y.-D., Choo, J., Kim, B., Yoon, I., Gwon, Y., and Min, S. Pomo: Policy optimization with multiple optima for reinforcement learning. Advances in Neural Information Processing Systems, 33:21188–21198, 2020.
  • Lee et al. (2016) Lee, G., Yang, E., and Hwang, S. Asymmetric multi-task learning based on task relatedness and loss. In International conference on machine learning, pp.  230–238. PMLR, 2016.
  • Lee et al. (2018) Lee, H. B., Yang, E., and Hwang, S. J. Deep asymmetric multi-task feature learning. In ICML, pp.  2956–2964, 2018.
  • Liu et al. (2021a) Liu, B., Liu, X., **, X., Stone, P., and Liu, Q. Conflict-averse gradient descent for multi-task learning. Advances in Neural Information Processing Systems, 34:18878–18890, 2021a.
  • Liu et al. (2021b) Liu, L., Li, Y., Kuang, Z., Xue, J., Chen, Y., Yang, W., Liao, Q., and Zhang, W. Towards impartial multi-task learning. In ICLR, 2021b.
  • Liu et al. (2022) Liu, S., James, S., Davison, A. J., and Johns, E. Auto-lambda: Disentangling dynamic task relationships. Trans. Mach. Learn. Res., 2022, 2022.
  • Malhotra et al. (2022) Malhotra, A., Vatsa, M., and Singh, R. Dropped scheduled task: Mitigating negative transfer in multi-task learning using dynamic task drop**. Transactions on Machine Learning Research, 2022.
  • Mao et al. (2021) Mao, Y., Wang, Z., Liu, W., Lin, X., and Hu, W. Banditmtl: Bandit-based multi-task learning for text classification. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  5506–5516, 2021.
  • Navon et al. (2022) Navon, A., Shamsian, A., Achituve, I., Maron, H., Kawaguchi, K., Chechik, G., and Fetaya, E. Multi-task learning as a bargaining game. arXiv preprint arXiv:2202.01017, 2022.
  • Nichol et al. (2018) Nichol, A., Achiam, J., and Schulman, J. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.
  • Raychaudhuri et al. (2022) Raychaudhuri, D. S., Suh, Y., Schulter, S., Yu, X., Faraki, M., Roy-Chowdhury, A. K., and Chandraker, M. Controllable dynamic multi-task architectures. CVPR, pp.  10945–10954, 2022.
  • Sherif et al. (2023) Sherif, A., Abid, A., Elattar, M., and ElHelw, M. Stg-mtl: Scalable task grou** for multi-task learning using data map. arXiv preprint arXiv:2307.03374, 2023.
  • Song et al. (2022) Song, X., Zheng, S., Cao, W., Yu, J., and Bian, J. Efficient and effective multi-task grou** via meta learning on task combinations. Advances in Neural Information Processing Systems, 35:37647–37659, 2022.
  • Standley et al. (2020) Standley, T., Zamir, A., Chen, D., Guibas, L., Malik, J., and Savarese, S. Which tasks should be learned together in multi-task learning? In ICML, pp.  9120–9132, 2020.
  • Swayamdipta et al. (2020) Swayamdipta, S., Schwartz, R., Lourie, N., Wang, Y., Hajishirzi, H., Smith, N. A., and Choi, Y. Dataset cartography: Map** and diagnosing datasets with training dynamics. In EMNLP, pp.  9275–9293. Association for Computational Linguistics, 2020.
  • Vandenhende et al. (2021) Vandenhende, S., Georgoulis, S., Van Gansbeke, W., Proesmans, M., Dai, D., and Van Gool, L. Multi-task learning for dense prediction tasks: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(7):3614–3633, 2021.
  • Wang & Yu (2023) Wang, C. and Yu, T. Efficient training of multi-task neural solver with multi-armed bandits. arXiv preprint arXiv:2305.06361, 2023.
  • Wang et al. (2024) Wang, C., Yu, Z., McAleer, S., Yu, T., and Yang, Y. Asp: Learn a universal neural solver! IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  • Wang et al. (2020a) Wang, J., Tantia, V., Ballas, N., and Rabbat, M. Lookahead converges to stationary points of smooth non-convex functions. In ICASSP, pp.  8604–8608, 2020a.
  • Wang et al. (2020b) Wang, Z., Lipton, Z. C., and Tsvetkov, Y. On negative interference in multilingual models: Findings and a meta-learning treatment. arXiv preprint arXiv:2010.03017, 2020b.
  • Wu et al. (2021) Wu, H., Xu, J., Wang, J., and Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Advances in Neural Information Processing Systems, 34:22419–22430, 2021.
  • Yu et al. (2020) Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., and Finn, C. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33:5824–5836, 2020.
  • Yue et al. (2023) Yue, Z., Zhang, Y., and Liang, J. Learning conflict-noticed architecture for multi-task learning. In AAAI Conference on Artificial Intelligence, 2023.
  • Zamir et al. (2018) Zamir, A. R., Sax, A., Shen, W., Guibas, L. J., Malik, J., and Savarese, S. Taskonomy: Disentangling task transfer learning. In CVPR, pp.  3712–3722, 2018.
  • Zhang et al. (2022a) Zhang, L., Liu, X., and Guan, H. A tree-structured multi-task model recommender. In AutoML, 2022a.
  • Zhang et al. (2019a) Zhang, M., Lucas, J., Ba, J., and Hinton, G. E. Lookahead optimizer: k steps forward, 1 step back. Advances in neural information processing systems, 32, 2019a.
  • Zhang & Yang (2021) Zhang, Y. and Yang, Q. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609, 2021.
  • Zhang et al. (2019b) Zhang, Z., Wu, B., and Schuller, B. Attention-augmented end-to-end multi-task learning for emotion prediction from speech. ICASSP, pp.  6705–6709, 2019b.
  • Zhang et al. (2022b) Zhang, Z., Yu, W., Yu, M., Guo, Z., and Jiang, M. A survey of multi-task learning in natural language processing: Regarding task relatedness and training methods. arXiv preprint arXiv:2204.03508, 2022b.
  • Zhou et al. (2021) Zhou, P., Yan, H., Yuan, X., Feng, J., and Yan, S. Towards understanding why lookahead generalizes better than sgd and beyond. Advances in Neural Information Processing Systems, 34:27290–27304, 2021.

Appendix A Proof of Proposition 4.3

Proof.

Following the Taylor expansion:

𝒮Ajt=ϕLj(ξ,θjt+1)T(ϕA{j}t+1ϕ{j}t+1)Lj(ϕ{j}t+1,θjt+1)=ϕLj(ξ,θjt+1)TiAηtϕLi(ϕt,θit)Lj(ϕ{j}t+1,θjt+1)superscriptsubscript𝒮𝐴𝑗𝑡subscriptitalic-ϕsubscript𝐿𝑗superscript𝜉superscriptsubscript𝜃𝑗𝑡1𝑇superscriptsubscriptitalic-ϕ𝐴𝑗𝑡1superscriptsubscriptitalic-ϕ𝑗𝑡1subscript𝐿𝑗superscriptsubscriptitalic-ϕ𝑗𝑡1superscriptsubscript𝜃𝑗𝑡1subscriptitalic-ϕsubscript𝐿𝑗superscript𝜉superscriptsubscript𝜃𝑗𝑡1𝑇subscript𝑖𝐴subscript𝜂𝑡subscriptitalic-ϕsubscript𝐿𝑖superscriptitalic-ϕ𝑡superscriptsubscript𝜃𝑖𝑡subscript𝐿𝑗superscriptsubscriptitalic-ϕ𝑗𝑡1superscriptsubscript𝜃𝑗𝑡1\displaystyle\mathcal{S}_{A\rightarrow j}^{t}=\frac{\nabla_{\phi}L_{j}(\xi,% \theta_{j}^{t+1})^{T}\left(\phi_{A\cup\{j\}}^{t+1}-\phi_{\{j\}}^{t+1}\right)}{% L_{j}\left(\phi_{\{j\}}^{t+1},\theta_{j}^{t+1}\right)}=\frac{\nabla_{\phi}L_{j% }(\xi,\theta_{j}^{t+1})^{T}\sum_{i\in A}\eta_{t}\nabla_{\phi}L_{i}(\phi^{t},% \theta_{i}^{t})}{L_{j}\left(\phi_{\{j\}}^{t+1},\theta_{j}^{t+1}\right)}caligraphic_S start_POSTSUBSCRIPT italic_A → italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_ξ , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_A ∪ { italic_j } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT - italic_ϕ start_POSTSUBSCRIPT { italic_j } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT { italic_j } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) end_ARG = divide start_ARG ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_ξ , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT { italic_j } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) end_ARG
=iAϕLj(ξ,θjt+1)T(ϕ{i,j}t+1ϕ{j}t+1)Lj(ϕ{j}t+1,θjt+1)absentsubscript𝑖𝐴subscriptitalic-ϕsubscript𝐿𝑗superscript𝜉superscriptsubscript𝜃𝑗𝑡1𝑇superscriptsubscriptitalic-ϕ𝑖𝑗𝑡1superscriptsubscriptitalic-ϕ𝑗𝑡1subscript𝐿𝑗superscriptsubscriptitalic-ϕ𝑗𝑡1superscriptsubscript𝜃𝑗𝑡1\displaystyle=\frac{\sum_{i\in A}\nabla_{\phi}L_{j}(\xi,\theta_{j}^{t+1})^{T}% \left(\phi_{\{i,j\}}^{t+1}-\phi_{\{j\}}^{t+1}\right)}{L_{j}\left(\phi_{\{j\}}^% {t+1},\theta_{j}^{t+1}\right)}= divide start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_ξ , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_ϕ start_POSTSUBSCRIPT { italic_i , italic_j } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT - italic_ϕ start_POSTSUBSCRIPT { italic_j } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT { italic_j } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) end_ARG
=1|A|iA𝒮ijt+iAϕ(Lj(ξ,θjt+1)1|A|Lj(ξj,θjt+1))TηtϕLi(ϕt,θit)Lj(ϕ{j}t+1,θjt+1)\displaystyle=\frac{1}{|A|}\sum_{i\in A}\mathcal{S}_{i\rightarrow j}^{t}+\frac% {\sum_{i\in A}\nabla_{\phi}(L_{j}(\xi,\theta_{j}^{t+1})-\frac{1}{|A|}L_{j}(\xi% _{j},\theta_{j}^{t+1}))^{T}\eta_{t}\nabla_{\phi}L_{i}(\phi^{t},\theta_{i}^{t})% }{L_{j}\left(\phi_{\{j\}}^{t+1},\theta_{j}^{t+1}\right)}= divide start_ARG 1 end_ARG start_ARG | italic_A | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A end_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + divide start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_ξ , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG | italic_A | end_ARG italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT { italic_j } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) end_ARG
=1|A|iA𝒮ijt+ηtiAϕ(Lj(ξ,θjt+1)1|A|Lj(ξj,θjt+1))ϕLi(ϕt,θit)Lj(ϕ{j}t+1,θjt+1)absent1𝐴subscript𝑖𝐴superscriptsubscript𝒮𝑖𝑗𝑡subscript𝜂𝑡subscript𝑖𝐴normsubscriptitalic-ϕsubscript𝐿𝑗𝜉superscriptsubscript𝜃𝑗𝑡11𝐴subscript𝐿𝑗subscript𝜉𝑗superscriptsubscript𝜃𝑗𝑡1normsubscriptitalic-ϕsubscript𝐿𝑖superscriptitalic-ϕ𝑡superscriptsubscript𝜃𝑖𝑡subscript𝐿𝑗superscriptsubscriptitalic-ϕ𝑗𝑡1superscriptsubscript𝜃𝑗𝑡1\displaystyle=\frac{1}{|A|}\sum_{i\in A}\mathcal{S}_{i\rightarrow j}^{t}+\eta_% {t}\frac{\sum_{i\in A}||\nabla_{\phi}(L_{j}(\xi,\theta_{j}^{t+1})-\frac{1}{|A|% }L_{j}(\xi_{j},\theta_{j}^{t+1}))||\cdot||\nabla_{\phi}L_{i}(\phi^{t},\theta_{% i}^{t})||}{L_{j}\left(\phi_{\{j\}}^{t+1},\theta_{j}^{t+1}\right)}= divide start_ARG 1 end_ARG start_ARG | italic_A | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A end_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A end_POSTSUBSCRIPT | | ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_ξ , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG | italic_A | end_ARG italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) ) | | ⋅ | | ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) | | end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT { italic_j } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) end_ARG
1|A|iA𝒮ijt+ηt(1+|A|)L2Cabsent1𝐴subscript𝑖𝐴superscriptsubscript𝒮𝑖𝑗𝑡subscript𝜂𝑡1𝐴superscript𝐿2𝐶\displaystyle\leq\frac{1}{|A|}\sum_{i\in A}\mathcal{S}_{i\rightarrow j}^{t}+% \frac{\eta_{t}(1+|A|)L^{2}}{C}≤ divide start_ARG 1 end_ARG start_ARG | italic_A | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_A end_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + divide start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( 1 + | italic_A | ) italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C end_ARG

Appendix B Formulation Transformation

We here show the transformation on Formulation 6 to the Mix-Integer Quadratic Programming with non-linear constraints: We introduce a continuous variable 𝐲[0,1]n𝐲superscript01𝑛\mathbf{y}\in[0,1]^{n}bold_y ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and binary variables Zijksubscript𝑍𝑖𝑗𝑘Z_{ijk}italic_Z start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT to obtain:

maxX,ysubscript𝑋𝑦\displaystyle\max_{X,y}\quadroman_max start_POSTSUBSCRIPT italic_X , italic_y end_POSTSUBSCRIPT j=1mk=1ni=1nSikyjZijksuperscriptsubscript𝑗1𝑚superscriptsubscript𝑘1𝑛superscriptsubscript𝑖1𝑛subscript𝑆𝑖𝑘subscript𝑦𝑗subscript𝑍𝑖𝑗𝑘\displaystyle\sum_{j=1}^{m}{\sum_{k=1}^{n}\sum_{i=1}^{n}S_{ik}y_{j}Z_{ijk}}∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT (8)
s.t. XT𝟏𝟏superscript𝑋𝑇11\displaystyle{X^{T}\mathbf{1}\geq\mathbf{1}}italic_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_1 ≥ bold_1
X𝟏𝟏𝑋11\displaystyle X\mathbf{1}\geq\mathbf{1}italic_X bold_1 ≥ bold_1
(BX)T𝟏bsuperscriptdirect-product𝐵𝑋𝑇1𝑏\displaystyle(B\odot X)^{T}\mathbf{1}\leq b( italic_B ⊙ italic_X ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_1 ≤ italic_b
XT𝐲=𝟏superscript𝑋𝑇𝐲1\displaystyle X^{T}\mathbf{y}=\mathbf{1}italic_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_y = bold_1
Zijk=XijXkji,j,ksubscript𝑍𝑖𝑗𝑘subscript𝑋𝑖𝑗subscript𝑋𝑘𝑗for-all𝑖𝑗𝑘\displaystyle Z_{ijk}=X_{ij}\cdot X_{kj}\ \forall i,j,kitalic_Z start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⋅ italic_X start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT ∀ italic_i , italic_j , italic_k
Xj1Xj221,j1j2formulae-sequencesuperscriptnormsubscript𝑋absentsubscript𝑗1subscript𝑋absentsubscript𝑗221subscript𝑗1subscript𝑗2\displaystyle||X_{\cdot j_{1}}-X_{\cdot j_{2}}||^{2}\geq 1,j_{1}\neq j_{2}| | italic_X start_POSTSUBSCRIPT ⋅ italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT ⋅ italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ 1 , italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
X{0,1}n×n𝑋superscript01𝑛𝑛\displaystyle{X\in\{0,1\}^{n\times n}}italic_X ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT

which can be solved by classical solvers (We use Gurobi(Gurobi Optimization, LLC, 2023) in this work).

Appendix C Further Results on CelebA and ETTm1

This section provides a comprehensive overview of the experimental settings and results, including the model architectures employed, hyperparameters, benchmark methods, and modifications implemented to ensure a more equitable comparison. The experiments were conducted on a server equipped with 8 NVIDIA GeForce RTX 3090 GPUs and an AMD EPYC 7542 32-Core CPU. The primary software versions used are CUDA 11.8, TensorFlow 2.14.1, and PyTorch 2.1.2.

C.1 Computer Vision Tasks

In the computer vision tasks, the employed neural network structure is a ResNet16 encoder coupled with MLP decoders for each task. Model structure hyperparameters and training attributes such as hidden dimensions, encoder layers, initial learning rate and scheduling method mirror those in (Fifty et al., 2021). We compare our approach against four benchmark MTL methodologies, namely, Naive-MTL, GradNrom, PCGrad, and Uncertainty Weights. Notably, we directly implement the precise grou**s presented by (Fifty et al., 2021) as the grou** outcome of TAG, HOA, and CS methods, though the total error is based on our testing outcomes. As Table 3 exhibits, our method typically surpasses other grou** methodologies and MTL methods, barring OPT grou**s, derived from TAG. These may not necessarily represent the optimal grou**s in our test results, as our approach outperforms OPT in the 2-split scenario.

The CelebA dataset, comprising over 200K celebrity images with 40 attribute annotations each, serves as a large-scale dataset for facial attributes. We select 9 attributes for our tasks: 5 o’Clock Shadow, Black Hair, Blond Hair, Brown Hair, Goatee, Mustache, No Beard, Rosy Cheeks, and Wearing Hat, aligning with the attributes employed in TAG.

Table 3: Comparison among our proposed method, single task training (STL) and multi-task learning (MTL) methods on the CelebA dataset. The table highlights the following representations: MTL’s strong baseline in orange, the OPT baseline in red for each split, results surpassing STL in blue, and the best outcomes (excluding the OPT baseline) in bold.
Method a1 a2 a3 a4 a5 a6 a7 a8 a9 Total Error
STL 6.47±0.044plus-or-minus6.470.0446.47\pm 0.0446.47 ± 0.044 11.27±0.037plus-or-minus11.270.03711.27\pm 0.03711.27 ± 0.037 4.19±0.006plus-or-minus4.190.0064.19\pm 0.0064.19 ± 0.006 12.29±0.020plus-or-minus12.290.02012.29\pm 0.02012.29 ± 0.020 2.72±0.038plus-or-minus2.720.0382.72\pm 0.0382.72 ± 0.038 3.12±0.015plus-or-minus3.120.0153.12\pm 0.0153.12 ± 0.015 4.98±0.031plus-or-minus4.980.0314.98\pm 0.0314.98 ± 0.031 4.85±0.019plus-or-minus4.850.0194.85\pm 0.0194.85 ± 0.019 0.73±0.007plus-or-minus0.730.0070.73\pm 0.0070.73 ± 0.007 50.62±0.031plus-or-minus50.620.03150.62\pm 0.03150.62 ± 0.031
MTL Naive-MTL 6.55±0.016plus-or-minus6.550.0166.55\pm 0.0166.55 ± 0.016 11.09±0.023plus-or-minus11.090.02311.09\pm 0.02311.09 ± 0.023 4.19±0.014plus-or-minus4.190.0144.19\pm 0.0144.19 ± 0.014 12.56±0.101plus-or-minus12.560.10112.56\pm 0.10112.56 ± 0.101 2.58±0.015plus-or-minus2.580.0152.58\pm 0.0152.58 ± 0.015 3.02±0.027plus-or-minus3.020.0273.02\pm 0.0273.02 ± 0.027 4.80±0.017plus-or-minus4.800.0174.80\pm 0.0174.80 ± 0.017 4.74±0.034plus-or-minus4.740.0344.74\pm 0.0344.74 ± 0.034 0.70±0.004plus-or-minus0.700.0040.70\pm 0.0040.70 ± 0.004 50.23±0.150plus-or-minus50.230.15050.23\pm 0.15050.23 ± 0.150
GradNorm 7.18±0.071plus-or-minus7.180.0717.18\pm 0.0717.18 ± 0.071 11.35±0.028plus-or-minus11.350.02811.35\pm 0.02811.35 ± 0.028 4.21±0.034plus-or-minus4.210.0344.21\pm 0.0344.21 ± 0.034 12.18±0.127plus-or-minus12.180.12712.18\pm 0.12712.18 ± 0.127 2.52±0.015plus-or-minus2.520.0152.52\pm 0.0152.52 ± 0.015 2.85±0.032plus-or-minus2.850.0322.85\pm 0.0322.85 ± 0.032 5.01±0.084plus-or-minus5.010.0845.01\pm 0.0845.01 ± 0.084 5.29±0.103plus-or-minus5.290.1035.29\pm 0.1035.29 ± 0.103 0.72±0.019plus-or-minus0.720.0190.72\pm 0.0190.72 ± 0.019 51.31±0.367plus-or-minus51.310.36751.31\pm 0.36751.31 ± 0.367
PCGrad 6.58±0.101plus-or-minus6.580.1016.58\pm 0.1016.58 ± 0.101 11.12±0.019plus-or-minus11.120.01911.12\pm 0.01911.12 ± 0.019 4.27±0.048plus-or-minus4.270.0484.27\pm 0.0484.27 ± 0.048 12.67±0.722plus-or-minus12.670.72212.67\pm 0.72212.67 ± 0.722 2.61±0.034plus-or-minus2.610.0342.61\pm 0.0342.61 ± 0.034 2.92±0.010plus-or-minus2.920.0102.92\pm 0.0102.92 ± 0.010 4.96±0.204plus-or-minus4.960.2044.96\pm 0.2044.96 ± 0.204 5.02±0.021plus-or-minus5.020.0215.02\pm 0.0215.02 ± 0.021 0.69±0.030plus-or-minus0.690.0300.69\pm 0.0300.69 ± 0.030 50.83±0.583plus-or-minus50.830.58350.83\pm 0.58350.83 ± 0.583
UW 6.72±0.037plus-or-minus6.720.0376.72\pm 0.0376.72 ± 0.037 11.32±0.019plus-or-minus11.320.01911.32\pm 0.01911.32 ± 0.019 4.30±0.150plus-or-minus4.300.1504.30\pm 0.1504.30 ± 0.150 13.61±0.541plus-or-minus13.610.54113.61\pm 0.54113.61 ± 0.541 2.74±0.049plus-or-minus2.740.0492.74\pm 0.0492.74 ± 0.049 2.93±0.051plus-or-minus2.930.0512.93\pm 0.0512.93 ± 0.051 5.43±0.175plus-or-minus5.430.1755.43\pm 0.1755.43 ± 0.175 4.86±0.040plus-or-minus4.860.0404.86\pm 0.0404.86 ± 0.040 0.79±0.025plus-or-minus0.790.0250.79\pm 0.0250.79 ± 0.025 52.71±0.528plus-or-minus52.710.52852.71\pm 0.52852.71 ± 0.528
Task Grou** OPT-2 6.60±0.017plus-or-minus6.600.0176.60\pm 0.0176.60 ± 0.017 11.16±0.002plus-or-minus11.160.00211.16\pm 0.00211.16 ± 0.002 4.04±0.028plus-or-minus4.040.0284.04\pm 0.0284.04 ± 0.028 12.05±0.133plus-or-minus12.050.13312.05\pm 0.13312.05 ± 0.133 2.60±0.010plus-or-minus2.600.0102.60\pm 0.0102.60 ± 0.010 2.88±0.011plus-or-minus2.880.0112.88\pm 0.0112.88 ± 0.011 4.82±0.037plus-or-minus4.820.0374.82\pm 0.0374.82 ± 0.037 4.71±0.037plus-or-minus4.710.0374.71\pm 0.0374.71 ± 0.037 0.74±0.025plus-or-minus0.740.0250.74\pm 0.0250.74 ± 0.025 49.58±0.154plus-or-minus49.580.15449.58\pm 0.15449.58 ± 0.154
Random-2 6.41±0.031plus-or-minus6.410.0316.41\pm 0.0316.41 ± 0.031 11.10±0.019plus-or-minus11.100.01911.10\pm 0.01911.10 ± 0.019 4.14±0.009plus-or-minus4.140.0094.14\pm 0.0094.14 ± 0.009 12.51±0.033plus-or-minus12.510.03312.51\pm 0.03312.51 ± 0.033 2.66±0.007plus-or-minus2.660.0072.66\pm 0.0072.66 ± 0.007 2.92±0.017plus-or-minus2.920.0172.92\pm 0.0172.92 ± 0.017 4.71±0.027plus-or-minus4.710.0274.71\pm 0.0274.71 ± 0.027 4.75±0.025plus-or-minus4.750.0254.75\pm 0.0254.75 ± 0.025 0.78±0.011plus-or-minus0.780.0110.78\pm 0.0110.78 ± 0.011 49.99±0.033plus-or-minus49.990.03349.99\pm 0.03349.99 ± 0.033
TAG-2 6.50±0.158plus-or-minus6.500.1586.50\pm 0.1586.50 ± 0.158 11.16±0.053plus-or-minus11.160.05311.16\pm 0.05311.16 ± 0.053 4.24±0.038plus-or-minus4.240.0384.24\pm 0.0384.24 ± 0.038 13.15±0.112plus-or-minus13.150.11213.15\pm 0.11213.15 ± 0.112 2.66±0.004plus-or-minus2.660.0042.66\pm 0.0042.66 ± 0.004 2.99±0.006plus-or-minus2.990.0062.99\pm 0.0062.99 ± 0.006 4.88±0.118plus-or-minus4.880.1184.88\pm 0.1184.88 ± 0.118 4.74±0.010plus-or-minus4.740.0104.74\pm 0.0104.74 ± 0.010 0.71±0.005plus-or-minus0.710.0050.71\pm 0.0050.71 ± 0.005 51.04±0.305plus-or-minus51.040.30551.04\pm 0.30551.04 ± 0.305
HOA-2 6.55±0.018plus-or-minus6.550.0186.55\pm 0.0186.55 ± 0.018 11.27±0.026plus-or-minus11.270.02611.27\pm 0.02611.27 ± 0.026 4.14±0.009plus-or-minus4.140.0094.14\pm 0.0094.14 ± 0.009 11.87±0.045plus-or-minus11.870.04511.87\pm 0.04511.87 ± 0.045 2.57±0.019plus-or-minus2.570.0192.57\pm 0.0192.57 ± 0.019 3.03±0.016plus-or-minus3.030.0163.03\pm 0.0163.03 ± 0.016 4.69±0.024plus-or-minus4.690.0244.69\pm 0.0244.69 ± 0.024 4.81±0.025plus-or-minus4.810.0254.81\pm 0.0254.81 ± 0.025 0.67±0.006plus-or-minus0.670.0060.67\pm 0.0060.67 ± 0.006 49.60±0.046plus-or-minus49.600.04649.60\pm 0.04649.60 ± 0.046
CS-2 6.58±0.028plus-or-minus6.580.0286.58\pm 0.0286.58 ± 0.028 11.15±0.027plus-or-minus11.150.02711.15\pm 0.02711.15 ± 0.027 4.26±0.097plus-or-minus4.260.0974.26\pm 0.0974.26 ± 0.097 12.35±0.229plus-or-minus12.350.22912.35\pm 0.22912.35 ± 0.229 2.54±0.029plus-or-minus2.540.0292.54\pm 0.0292.54 ± 0.029 3.06±0.017plus-or-minus3.060.0173.06\pm 0.0173.06 ± 0.017 4.86±0.059plus-or-minus4.860.0594.86\pm 0.0594.86 ± 0.059 4.77±0.083plus-or-minus4.770.0834.77\pm 0.0834.77 ± 0.083 0.71±0.017plus-or-minus0.710.0170.71\pm 0.0170.71 ± 0.017 50.28±0.254plus-or-minus50.280.25450.28\pm 0.25450.28 ± 0.254
Ours-2 6.55±0.016plus-or-minus6.550.0166.55\pm 0.0166.55 ± 0.016 11.09±0.023plus-or-minus11.090.02311.09\pm 0.02311.09 ± 0.023 4.19±0.014plus-or-minus4.190.0144.19\pm 0.0144.19 ± 0.014 12.05±0.133plus-or-minus12.050.13312.05\pm 0.13312.05 ± 0.133 2.58±0.015plus-or-minus2.580.0152.58\pm 0.0152.58 ± 0.015 2.88±0.011plus-or-minus2.880.0112.88\pm 0.0112.88 ± 0.011 4.80±0.017plus-or-minus4.800.0174.80\pm 0.0174.80 ± 0.017 4.71±0.037plus-or-minus4.710.0374.71\pm 0.0374.71 ± 0.037 0.70±0.004plus-or-minus0.700.0040.70\pm 0.0040.70 ± 0.004 49.53±0.128plus-or-minus49.530.12849.53\pm 0.12849.53 ± 0.128
OPT-3 6.27±0.027plus-or-minus6.270.0276.27\pm 0.0276.27 ± 0.027 11.16±0.002plus-or-minus11.160.00211.16\pm 0.00211.16 ± 0.002 4.04±0.028plus-or-minus4.040.0284.04\pm 0.0284.04 ± 0.028 12.24±0.367plus-or-minus12.240.36712.24\pm 0.36712.24 ± 0.367 2.59±0.006plus-or-minus2.590.0062.59\pm 0.0062.59 ± 0.006 2.92±0.011plus-or-minus2.920.0112.92\pm 0.0112.92 ± 0.011 4.67±0.013plus-or-minus4.670.0134.67\pm 0.0134.67 ± 0.013 4.72±0.019plus-or-minus4.720.0194.72\pm 0.0194.72 ± 0.019 0.71±0.009plus-or-minus0.710.0090.71\pm 0.0090.71 ± 0.009 49.32±0.407plus-or-minus49.320.40749.32\pm 0.40749.32 ± 0.407
Random-3 6.65±0.030plus-or-minus6.650.0306.65\pm 0.0306.65 ± 0.030 11.16±0.018plus-or-minus11.160.01811.16\pm 0.01811.16 ± 0.018 4.12±0.019plus-or-minus4.120.0194.12\pm 0.0194.12 ± 0.019 12.16±0.147plus-or-minus12.160.14712.16\pm 0.14712.16 ± 0.147 2.61±0.065plus-or-minus2.610.0652.61\pm 0.0652.61 ± 0.065 2.90±0.019plus-or-minus2.900.0192.90\pm 0.0192.90 ± 0.019 4.78±0.018plus-or-minus4.780.0184.78\pm 0.0184.78 ± 0.018 4.71±0.010plus-or-minus4.710.0104.71\pm 0.0104.71 ± 0.010 0.79±0.012plus-or-minus0.790.0120.79\pm 0.0120.79 ± 0.012 49.88±0.128plus-or-minus49.880.12849.88\pm 0.12849.88 ± 0.128
TAG-3 6.50±0.158plus-or-minus6.500.1586.50\pm 0.1586.50 ± 0.158 11.14±0.047plus-or-minus11.140.04711.14\pm 0.04711.14 ± 0.047 4.11±0.007plus-or-minus4.110.0074.11\pm 0.0074.11 ± 0.007 12.31±0.030plus-or-minus12.310.03012.31\pm 0.03012.31 ± 0.030 2.55±0.029plus-or-minus2.550.0292.55\pm 0.0292.55 ± 0.029 2.92±0.063plus-or-minus2.920.0632.92\pm 0.0632.92 ± 0.063 4.88±0.118plus-or-minus4.880.1184.88\pm 0.1184.88 ± 0.118 4.73±0.013plus-or-minus4.730.0134.73\pm 0.0134.73 ± 0.013 0.76±0.014plus-or-minus0.760.0140.76\pm 0.0140.76 ± 0.014 49.90±0.249plus-or-minus49.900.24949.90\pm 0.24949.90 ± 0.249
HOA-3 6.55±0.024plus-or-minus6.550.0246.55\pm 0.0246.55 ± 0.024 11.14±0.047plus-or-minus11.140.04711.14\pm 0.04711.14 ± 0.047 4.12±0.023plus-or-minus4.120.0234.12\pm 0.0234.12 ± 0.023 12.31±0.030plus-or-minus12.310.03012.31\pm 0.03012.31 ± 0.030 2.57±0.019plus-or-minus2.570.0192.57\pm 0.0192.57 ± 0.019 2.84±0.002plus-or-minus2.840.0022.84\pm 0.0022.84 ± 0.002 4.69±0.024plus-or-minus4.690.0244.69\pm 0.0244.69 ± 0.024 4.76±0.011plus-or-minus4.760.0114.76\pm 0.0114.76 ± 0.011 0.73±0.012plus-or-minus0.730.0120.73\pm 0.0120.73 ± 0.012 49.72±0.143plus-or-minus49.720.14349.72\pm 0.14349.72 ± 0.143
CS-3 6.50±0.158plus-or-minus6.500.1586.50\pm 0.1586.50 ± 0.158 11.18±0.008plus-or-minus11.180.00811.18\pm 0.00811.18 ± 0.008 4.10±0.013plus-or-minus4.100.0134.10\pm 0.0134.10 ± 0.013 13.06±0.105plus-or-minus13.060.10513.06\pm 0.10513.06 ± 0.105 2.59±0.026plus-or-minus2.590.0262.59\pm 0.0262.59 ± 0.026 3.06±0.017plus-or-minus3.060.0173.06\pm 0.0173.06 ± 0.017 4.88±0.118plus-or-minus4.880.1184.88\pm 0.1184.88 ± 0.118 4.83±0.013plus-or-minus4.830.0134.83\pm 0.0134.83 ± 0.013 0.74±0.004plus-or-minus0.740.0040.74\pm 0.0040.74 ± 0.004 50.95±0.360plus-or-minus50.950.36050.95\pm 0.36050.95 ± 0.360
Ours-3 6.55±0.016plus-or-minus6.550.0166.55\pm 0.0166.55 ± 0.016 11.09±0.023plus-or-minus11.090.02311.09\pm 0.02311.09 ± 0.023 4.19±0.014plus-or-minus4.190.0144.19\pm 0.0144.19 ± 0.014 12.05±0.133plus-or-minus12.050.13312.05\pm 0.13312.05 ± 0.133 2.58±0.015plus-or-minus2.580.0152.58\pm 0.0152.58 ± 0.015 2.88±0.011plus-or-minus2.880.0112.88\pm 0.0112.88 ± 0.011 4.80±0.017plus-or-minus4.800.0174.80\pm 0.0174.80 ± 0.017 4.69±0.027plus-or-minus4.690.0274.69\pm 0.0274.69 ± 0.027 0.70±0.004plus-or-minus0.700.0040.70\pm 0.0040.70 ± 0.004 49.52±0.104plus-or-minus49.520.10449.52\pm 0.10449.52 ± 0.104
OPT-4 6.30±0.036plus-or-minus6.300.0366.30\pm 0.0366.30 ± 0.036 11.11±0.020plus-or-minus11.110.02011.11\pm 0.02011.11 ± 0.020 4.08±0.079plus-or-minus4.080.0794.08\pm 0.0794.08 ± 0.079 11.92±0.012plus-or-minus11.920.01211.92\pm 0.01211.92 ± 0.012 2.68±0.019plus-or-minus2.680.0192.68\pm 0.0192.68 ± 0.019 2.94±0.021plus-or-minus2.940.0212.94\pm 0.0212.94 ± 0.021 4.67±0.013plus-or-minus4.670.0134.67\pm 0.0134.67 ± 0.013 4.77±0.006plus-or-minus4.770.0064.77\pm 0.0064.77 ± 0.006 0.71±0.009plus-or-minus0.710.0090.71\pm 0.0090.71 ± 0.009 49.17±0.009plus-or-minus49.170.00949.17\pm 0.00949.17 ± 0.009
Random-4 6.40±0.023plus-or-minus6.400.0236.40\pm 0.0236.40 ± 0.023 11.25±0.029plus-or-minus11.250.02911.25\pm 0.02911.25 ± 0.029 4.11±0.010plus-or-minus4.110.0104.11\pm 0.0104.11 ± 0.010 12.92±0.102plus-or-minus12.920.10212.92\pm 0.10212.92 ± 0.102 2.55±0.014plus-or-minus2.550.0142.55\pm 0.0142.55 ± 0.014 3.01±0.002plus-or-minus3.010.0023.01\pm 0.0023.01 ± 0.002 4.74±0.023plus-or-minus4.740.0234.74\pm 0.0234.74 ± 0.023 4.75±0.006plus-or-minus4.750.0064.75\pm 0.0064.75 ± 0.006 0.76±0.013plus-or-minus0.760.0130.76\pm 0.0130.76 ± 0.013 50.49±0.117plus-or-minus50.490.11750.49\pm 0.11750.49 ± 0.117
TAG-4 6.50±0.158plus-or-minus6.500.1586.50\pm 0.1586.50 ± 0.158 10.97±0.023plus-or-minus10.970.02310.97\pm 0.02310.97 ± 0.023 4.09±0.033plus-or-minus4.090.0334.09\pm 0.0334.09 ± 0.033 12.31±0.030plus-or-minus12.310.03012.31\pm 0.03012.31 ± 0.030 2.62±0.010plus-or-minus2.620.0102.62\pm 0.0102.62 ± 0.010 2.99±0.004plus-or-minus2.990.0042.99\pm 0.0042.99 ± 0.004 4.76±0.009plus-or-minus4.760.0094.76\pm 0.0094.76 ± 0.009 4.63±0.012plus-or-minus4.630.0124.63\pm 0.0124.63 ± 0.012 0.72±0.002plus-or-minus0.720.0020.72\pm 0.0020.72 ± 0.002 49.60±0.128plus-or-minus49.600.12849.60\pm 0.12849.60 ± 0.128
HOA-4 6.65±0.033plus-or-minus6.650.0336.65\pm 0.0336.65 ± 0.033 11.14±0.047plus-or-minus11.140.04711.14\pm 0.04711.14 ± 0.047 4.11±0.007plus-or-minus4.110.0074.11\pm 0.0074.11 ± 0.007 12.31±0.030plus-or-minus12.310.03012.31\pm 0.03012.31 ± 0.030 2.55±0.029plus-or-minus2.550.0292.55\pm 0.0292.55 ± 0.029 2.92±0.063plus-or-minus2.920.0632.92\pm 0.0632.92 ± 0.063 4.69±0.024plus-or-minus4.690.0244.69\pm 0.0244.69 ± 0.024 4.73±0.013plus-or-minus4.730.0134.73\pm 0.0134.73 ± 0.013 0.76±0.014plus-or-minus0.760.0140.76\pm 0.0140.76 ± 0.014 49.85±0.080plus-or-minus49.850.08049.85\pm 0.08049.85 ± 0.080
CS-4 6.50±0.158plus-or-minus6.500.1586.50\pm 0.1586.50 ± 0.158 10.97±0.023plus-or-minus10.970.02310.97\pm 0.02310.97 ± 0.023 4.09±0.033plus-or-minus4.090.0334.09\pm 0.0334.09 ± 0.033 12.31±0.030plus-or-minus12.310.03012.31\pm 0.03012.31 ± 0.030 2.59±0.026plus-or-minus2.590.0262.59\pm 0.0262.59 ± 0.026 3.06±0.017plus-or-minus3.060.0173.06\pm 0.0173.06 ± 0.017 4.88±0.118plus-or-minus4.880.1184.88\pm 0.1184.88 ± 0.118 4.63±0.012plus-or-minus4.630.0124.63\pm 0.0124.63 ± 0.012 0.72±0.002plus-or-minus0.720.0020.72\pm 0.0020.72 ± 0.002 49.75±0.233plus-or-minus49.750.23349.75\pm 0.23349.75 ± 0.233
Ours-4 6.55±0.016plus-or-minus6.550.0166.55\pm 0.0166.55 ± 0.016 11.09±0.023plus-or-minus11.090.02311.09\pm 0.02311.09 ± 0.023 4.09±0.015plus-or-minus4.090.0154.09\pm 0.0154.09 ± 0.015 12.05±0.133plus-or-minus12.050.13312.05\pm 0.13312.05 ± 0.133 2.58±0.015plus-or-minus2.580.0152.58\pm 0.0152.58 ± 0.015 2.88±0.011plus-or-minus2.880.0112.88\pm 0.0112.88 ± 0.011 4.72±0.053plus-or-minus4.720.0534.72\pm 0.0534.72 ± 0.053 4.69±0.027plus-or-minus4.690.0274.69\pm 0.0274.69 ± 0.027 0.70±0.004plus-or-minus0.700.0040.70\pm 0.0040.70 ± 0.004 49.33±0.127plus-or-minus49.330.12749.33\pm 0.12749.33 ± 0.127
Random-5 6.67±0.016plus-or-minus6.670.0166.67\pm 0.0166.67 ± 0.016 11.20±0.098plus-or-minus11.200.09811.20\pm 0.09811.20 ± 0.098 4.10±0.045plus-or-minus4.100.0454.10\pm 0.0454.10 ± 0.045 11.83±0.193plus-or-minus11.830.19311.83\pm 0.19311.83 ± 0.193 2.61±0.004plus-or-minus2.610.0042.61\pm 0.0042.61 ± 0.004 2.87±0.004plus-or-minus2.870.0042.87\pm 0.0042.87 ± 0.004 5.02±0.043plus-or-minus5.020.0435.02\pm 0.0435.02 ± 0.043 4.81±0.078plus-or-minus4.810.0784.81\pm 0.0784.81 ± 0.078 0.80±0.014plus-or-minus0.800.0140.80\pm 0.0140.80 ± 0.014 49.91±0.362plus-or-minus49.910.36249.91\pm 0.36249.91 ± 0.362
Ours-5 6.55±0.016plus-or-minus6.550.0166.55\pm 0.0166.55 ± 0.016 11.09±0.023plus-or-minus11.090.02311.09\pm 0.02311.09 ± 0.023 4.09±0.015plus-or-minus4.090.0154.09\pm 0.0154.09 ± 0.015 12.05±0.133plus-or-minus12.050.13312.05\pm 0.13312.05 ± 0.133 2.54±0.029plus-or-minus2.540.0292.54\pm 0.0292.54 ± 0.029 2.88±0.011plus-or-minus2.880.0112.88\pm 0.0112.88 ± 0.011 4.72±0.053plus-or-minus4.720.0534.72\pm 0.0534.72 ± 0.053 4.69±0.027plus-or-minus4.690.0274.69\pm 0.0274.69 ± 0.027 0.70±0.004plus-or-minus0.700.0040.70\pm 0.0040.70 ± 0.004 49.30±0.092plus-or-minus49.300.09249.30\pm 0.09249.30 ± 0.092
Random-6 6.49±0.022plus-or-minus6.490.0226.49\pm 0.0226.49 ± 0.022 11.23±0.029plus-or-minus11.230.02911.23\pm 0.02911.23 ± 0.029 4.10±0.018plus-or-minus4.100.0184.10\pm 0.0184.10 ± 0.018 11.79±0.023plus-or-minus11.790.02311.79\pm 0.02311.79 ± 0.023 2.57±0.017plus-or-minus2.570.0172.57\pm 0.0172.57 ± 0.017 2.95±0.008plus-or-minus2.950.0082.95\pm 0.0082.95 ± 0.008 4.83±0.019plus-or-minus4.830.0194.83\pm 0.0194.83 ± 0.019 4.71±0.030plus-or-minus4.710.0304.71\pm 0.0304.71 ± 0.030 0.77±0.009plus-or-minus0.770.0090.77\pm 0.0090.77 ± 0.009 49.45±0.028plus-or-minus49.450.02849.45\pm 0.02849.45 ± 0.028
Ours-6 6.55±0.016plus-or-minus6.550.0166.55\pm 0.0166.55 ± 0.016 11.09±0.023plus-or-minus11.090.02311.09\pm 0.02311.09 ± 0.023 4.09±0.015plus-or-minus4.090.0154.09\pm 0.0154.09 ± 0.015 12.05±0.133plus-or-minus12.050.13312.05\pm 0.13312.05 ± 0.133 2.54±0.029plus-or-minus2.540.0292.54\pm 0.0292.54 ± 0.029 2.88±0.011plus-or-minus2.880.0112.88\pm 0.0112.88 ± 0.011 4.72±0.053plus-or-minus4.720.0534.72\pm 0.0534.72 ± 0.053 4.69±0.027plus-or-minus4.690.0274.69\pm 0.0274.69 ± 0.027 0.70±0.004plus-or-minus0.700.0040.70\pm 0.0040.70 ± 0.004 49.30±0.092plus-or-minus49.300.09249.30\pm 0.09249.30 ± 0.092

C.2 Time Series Tasks

In the domain of time series analysis, the Autoformer architecture is employed as the neural network structure. Time series forecasting encompasses two widely recognized approaches: multivariate and univariate. Given that the ETTm1 dataset comprises seven time series, it is applicable under both frameworks. To facilitate a task grou** experiment, we configure it as several univariate prediction tasks, adapting the Autoformer model to maintain a majority of its components common across tasks, while assigning a unique decoder to each task for making predictions. Regarding the detailed hyperparameter settings for the model’s structure and training, we adhere to the configurations specified by (Song et al., 2022).

Table 4 reveals that our approach surpasses other task grou** strategies across all divisions. Nevertheless, it falls marginally short of the top MTL method, PCGrade, in the 2-split and 3-split scenarios. Despite this, our method continues to outperform all MTL techniques in terms of overall Mean Absolute Error (MAE) as the number of splits increases. Furthermore, the results for TAG and Optimal across the 4-split scenario were not obtained due to intractable computational constraints.

Table 4: Comparison among our proposed method, single task training (STL) and multi-task learning (MTL) methods on the ETTm1 dataset. The table highlights the following representations: STL’s strong baseline in orange, the optimal baseline in red for each split, results surpassing STL in blue, and the best outcomes (excluding the optimal baseline) in bold.
Method a1 a2 a3 a4 a5 a6 a7 Total MAE
STL 0.64±0.016plus-or-minus0.640.0160.64\pm 0.0160.64 ± 0.016 0.37±0.009plus-or-minus0.370.0090.37\pm 0.0090.37 ± 0.009 0.68±0.015plus-or-minus0.680.0150.68\pm 0.0150.68 ± 0.015 0.36±0.011plus-or-minus0.360.0110.36\pm 0.0110.36 ± 0.011 0.56±0.027plus-or-minus0.560.0270.56\pm 0.0270.56 ± 0.027 0.29±0.002plus-or-minus0.290.0020.29\pm 0.0020.29 ± 0.002 0.15±0.006plus-or-minus0.150.0060.15\pm 0.0060.15 ± 0.006 3.05±0.030plus-or-minus3.050.0303.05\pm 0.0303.05 ± 0.030
MTL Bandit-MTL 0.67±0.041plus-or-minus0.670.0410.67\pm 0.0410.67 ± 0.041 0.37±0.010plus-or-minus0.370.0100.37\pm 0.0100.37 ± 0.010 0.68±0.051plus-or-minus0.680.0510.68\pm 0.0510.68 ± 0.051 0.34±0.003plus-or-minus0.340.0030.34\pm 0.0030.34 ± 0.003 0.58±0.017plus-or-minus0.580.0170.58\pm 0.0170.58 ± 0.017 0.26±0.004plus-or-minus0.260.0040.26\pm 0.0040.26 ± 0.004 0.14±0.003plus-or-minus0.140.0030.14\pm 0.0030.14 ± 0.003 3.05±0.104plus-or-minus3.050.1043.05\pm 0.1043.05 ± 0.104
LS 0.67±0.014plus-or-minus0.670.0140.67\pm 0.0140.67 ± 0.014 0.38±0.004plus-or-minus0.380.0040.38\pm 0.0040.38 ± 0.004 0.67±0.040plus-or-minus0.670.0400.67\pm 0.0400.67 ± 0.040 0.36±0.007plus-or-minus0.360.0070.36\pm 0.0070.36 ± 0.007 0.58±0.023plus-or-minus0.580.0230.58\pm 0.0230.58 ± 0.023 0.28±0.003plus-or-minus0.280.0030.28\pm 0.0030.28 ± 0.003 0.15±0.002plus-or-minus0.150.0020.15\pm 0.0020.15 ± 0.002 3.09±0.034plus-or-minus3.090.0343.09\pm 0.0343.09 ± 0.034
UW 0.67±0.011plus-or-minus0.670.0110.67\pm 0.0110.67 ± 0.011 0.38±0.014plus-or-minus0.380.0140.38\pm 0.0140.38 ± 0.014 0.62±0.011plus-or-minus0.620.0110.62\pm 0.0110.62 ± 0.011 0.36±0.013plus-or-minus0.360.0130.36\pm 0.0130.36 ± 0.013 0.59±0.021plus-or-minus0.590.0210.59\pm 0.0210.59 ± 0.021 0.27±0.011plus-or-minus0.270.0110.27\pm 0.0110.27 ± 0.011 0.16±0.013plus-or-minus0.160.0130.16\pm 0.0130.16 ± 0.013 3.06±0.079plus-or-minus3.060.0793.06\pm 0.0793.06 ± 0.079
Nash-MTL 0.61±0.017plus-or-minus0.610.0170.61\pm 0.0170.61 ± 0.017 0.38±0.011plus-or-minus0.380.0110.38\pm 0.0110.38 ± 0.011 0.65±0.006plus-or-minus0.650.0060.65\pm 0.0060.65 ± 0.006 0.36±0.004plus-or-minus0.360.0040.36\pm 0.0040.36 ± 0.004 0.59±0.037plus-or-minus0.590.0370.59\pm 0.0370.59 ± 0.037 0.27±0.003plus-or-minus0.270.0030.27\pm 0.0030.27 ± 0.003 0.15±0.013plus-or-minus0.150.0130.15\pm 0.0130.15 ± 0.013 3.00±0.059plus-or-minus3.000.0593.00\pm 0.0593.00 ± 0.059
PCGrad 0.65±0.033plus-or-minus0.650.0330.65\pm 0.0330.65 ± 0.033 0.38±0.011plus-or-minus0.380.0110.38\pm 0.0110.38 ± 0.011 0.62±0.020plus-or-minus0.620.0200.62\pm 0.0200.62 ± 0.020 0.35±0.001plus-or-minus0.350.0010.35\pm 0.0010.35 ± 0.001 0.57±0.006plus-or-minus0.570.0060.57\pm 0.0060.57 ± 0.006 0.27±0.011plus-or-minus0.270.0110.27\pm 0.0110.27 ± 0.011 0.15±0.001plus-or-minus0.150.0010.15\pm 0.0010.15 ± 0.001 2.99±0.066plus-or-minus2.990.0662.99\pm 0.0662.99 ± 0.066
Naive-MTL 0.66±0.014plus-or-minus0.660.0140.66\pm 0.0140.66 ± 0.014 0.38±0.006plus-or-minus0.380.0060.38\pm 0.0060.38 ± 0.006 0.64±0.056plus-or-minus0.640.0560.64\pm 0.0560.64 ± 0.056 0.35±0.009plus-or-minus0.350.0090.35\pm 0.0090.35 ± 0.009 0.59±0.016plus-or-minus0.590.0160.59\pm 0.0160.59 ± 0.016 0.28±0.005plus-or-minus0.280.0050.28\pm 0.0050.28 ± 0.005 0.15±0.009plus-or-minus0.150.0090.15\pm 0.0090.15 ± 0.009 3.03±0.101plus-or-minus3.030.1013.03\pm 0.1013.03 ± 0.101
Task Grou** Optimal-2 0.61±0.015plus-or-minus0.610.0150.61\pm 0.0150.61 ± 0.015 0.37±0.007plus-or-minus0.370.0070.37\pm 0.0070.37 ± 0.007 0.62±0.036plus-or-minus0.620.0360.62\pm 0.0360.62 ± 0.036 0.35±0.003plus-or-minus0.350.0030.35\pm 0.0030.35 ± 0.003 0.56±0.029plus-or-minus0.560.0290.56\pm 0.0290.56 ± 0.029 0.28±0.008plus-or-minus0.280.0080.28\pm 0.0080.28 ± 0.008 0.15±0.011plus-or-minus0.150.0110.15\pm 0.0110.15 ± 0.011 2.93±0.064plus-or-minus2.930.0642.93\pm 0.0642.93 ± 0.064
TAG-2 0.66±0.014plus-or-minus0.660.0140.66\pm 0.0140.66 ± 0.014 0.37±0.005plus-or-minus0.370.0050.37\pm 0.0050.37 ± 0.005 0.64±0.056plus-or-minus0.640.0560.64\pm 0.0560.64 ± 0.056 0.35±0.009plus-or-minus0.350.0090.35\pm 0.0090.35 ± 0.009 0.59±0.016plus-or-minus0.590.0160.59\pm 0.0160.59 ± 0.016 0.28±0.005plus-or-minus0.280.0050.28\pm 0.0050.28 ± 0.005 0.15±0.009plus-or-minus0.150.0090.15\pm 0.0090.15 ± 0.009 3.03±0.100plus-or-minus3.030.1003.03\pm 0.1003.03 ± 0.100
Random-2 0.66±0.014plus-or-minus0.660.0140.66\pm 0.0140.66 ± 0.014 0.38±0.006plus-or-minus0.380.0060.38\pm 0.0060.38 ± 0.006 0.64±0.056plus-or-minus0.640.0560.64\pm 0.0560.64 ± 0.056 0.35±0.009plus-or-minus0.350.0090.35\pm 0.0090.35 ± 0.009 0.59±0.016plus-or-minus0.590.0160.59\pm 0.0160.59 ± 0.016 0.28±0.005plus-or-minus0.280.0050.28\pm 0.0050.28 ± 0.005 0.15±0.009plus-or-minus0.150.0090.15\pm 0.0090.15 ± 0.009 3.03±0.101plus-or-minus3.030.1013.03\pm 0.1013.03 ± 0.101
Ours-2 0.66±0.014plus-or-minus0.660.0140.66\pm 0.0140.66 ± 0.014 0.38±0.006plus-or-minus0.380.0060.38\pm 0.0060.38 ± 0.006 0.64±0.056plus-or-minus0.640.0560.64\pm 0.0560.64 ± 0.056 0.35±0.009plus-or-minus0.350.0090.35\pm 0.0090.35 ± 0.009 0.57±0.017plus-or-minus0.570.0170.57\pm 0.0170.57 ± 0.017 0.27±0.002plus-or-minus0.270.0020.27\pm 0.0020.27 ± 0.002 0.15±0.009plus-or-minus0.150.0090.15\pm 0.0090.15 ± 0.009 3.01±0.096plus-or-minus3.010.0963.01\pm 0.0963.01 ± 0.096
Optimal-3 0.61±0.015plus-or-minus0.610.0150.61\pm 0.0150.61 ± 0.015 0.37±0.007plus-or-minus0.370.0070.37\pm 0.0070.37 ± 0.007 0.62±0.036plus-or-minus0.620.0360.62\pm 0.0360.62 ± 0.036 0.35±0.003plus-or-minus0.350.0030.35\pm 0.0030.35 ± 0.003 0.55±0.022plus-or-minus0.550.0220.55\pm 0.0220.55 ± 0.022 0.27±0.015plus-or-minus0.270.0150.27\pm 0.0150.27 ± 0.015 0.15±0.011plus-or-minus0.150.0110.15\pm 0.0110.15 ± 0.011 2.91±0.057plus-or-minus2.910.0572.91\pm 0.0572.91 ± 0.057
TAG-3 0.63±0.030plus-or-minus0.630.0300.63\pm 0.0300.63 ± 0.030 0.38±0.008plus-or-minus0.380.0080.38\pm 0.0080.38 ± 0.008 0.65±0.017plus-or-minus0.650.0170.65\pm 0.0170.65 ± 0.017 0.36±0.007plus-or-minus0.360.0070.36\pm 0.0070.36 ± 0.007 0.57±0.022plus-or-minus0.570.0220.57\pm 0.0220.57 ± 0.022 0.29±0.016plus-or-minus0.290.0160.29\pm 0.0160.29 ± 0.016 0.16±0.001plus-or-minus0.160.0010.16\pm 0.0010.16 ± 0.001 3.04±0.063plus-or-minus3.040.0633.04\pm 0.0633.04 ± 0.063
Random-3 0.67±0.025plus-or-minus0.670.0250.67\pm 0.0250.67 ± 0.025 0.39±0.004plus-or-minus0.390.0040.39\pm 0.0040.39 ± 0.004 0.65±0.031plus-or-minus0.650.0310.65\pm 0.0310.65 ± 0.031 0.35±0.004plus-or-minus0.350.0040.35\pm 0.0040.35 ± 0.004 0.57±0.020plus-or-minus0.570.0200.57\pm 0.0200.57 ± 0.020 0.28±0.009plus-or-minus0.280.0090.28\pm 0.0090.28 ± 0.009 0.14±0.001plus-or-minus0.140.0010.14\pm 0.0010.14 ± 0.001 3.05±0.060plus-or-minus3.050.0603.05\pm 0.0603.05 ± 0.060
Ours-3 0.66±0.014plus-or-minus0.660.0140.66\pm 0.0140.66 ± 0.014 0.38±0.006plus-or-minus0.380.0060.38\pm 0.0060.38 ± 0.006 0.64±0.056plus-or-minus0.640.0560.64\pm 0.0560.64 ± 0.056 0.35±0.009plus-or-minus0.350.0090.35\pm 0.0090.35 ± 0.009 0.56±0.011plus-or-minus0.560.0110.56\pm 0.0110.56 ± 0.011 0.27±0.002plus-or-minus0.270.0020.27\pm 0.0020.27 ± 0.002 0.15±0.009plus-or-minus0.150.0090.15\pm 0.0090.15 ± 0.009 3.00±0.079plus-or-minus3.000.0793.00\pm 0.0793.00 ± 0.079
Optimal-4 0.61±0.015plus-or-minus0.610.0150.61\pm 0.0150.61 ± 0.015 0.37±0.007plus-or-minus0.370.0070.37\pm 0.0070.37 ± 0.007 0.62±0.036plus-or-minus0.620.0360.62\pm 0.0360.62 ± 0.036 0.35±0.003plus-or-minus0.350.0030.35\pm 0.0030.35 ± 0.003 0.55±0.022plus-or-minus0.550.0220.55\pm 0.0220.55 ± 0.022 0.27±0.015plus-or-minus0.270.0150.27\pm 0.0150.27 ± 0.015 0.14±0.001plus-or-minus0.140.0010.14\pm 0.0010.14 ± 0.001 2.90±0.051plus-or-minus2.900.0512.90\pm 0.0512.90 ± 0.051
TAG-4 0.63±0.004plus-or-minus0.630.0040.63\pm 0.0040.63 ± 0.004 0.38±0.008plus-or-minus0.380.0080.38\pm 0.0080.38 ± 0.008 0.63±0.030plus-or-minus0.630.0300.63\pm 0.0300.63 ± 0.030 0.36±0.013plus-or-minus0.360.0130.36\pm 0.0130.36 ± 0.013 0.58±0.025plus-or-minus0.580.0250.58\pm 0.0250.58 ± 0.025 0.28±0.008plus-or-minus0.280.0080.28\pm 0.0080.28 ± 0.008 0.16±0.007plus-or-minus0.160.0070.16\pm 0.0070.16 ± 0.007 3.03±0.053plus-or-minus3.030.0533.03\pm 0.0533.03 ± 0.053
Random-4 0.64±0.026plus-or-minus0.640.0260.64\pm 0.0260.64 ± 0.026 0.38±0.005plus-or-minus0.380.0050.38\pm 0.0050.38 ± 0.005 0.68±0.043plus-or-minus0.680.0430.68\pm 0.0430.68 ± 0.043 0.35±0.005plus-or-minus0.350.0050.35\pm 0.0050.35 ± 0.005 0.55±0.022plus-or-minus0.550.0220.55\pm 0.0220.55 ± 0.022 0.27±0.005plus-or-minus0.270.0050.27\pm 0.0050.27 ± 0.005 0.15±0.011plus-or-minus0.150.0110.15\pm 0.0110.15 ± 0.011 3.02±0.054plus-or-minus3.020.0543.02\pm 0.0543.02 ± 0.054
Ours-4 0.63±0.030plus-or-minus0.630.0300.63\pm 0.0300.63 ± 0.030 0.38±0.007plus-or-minus0.380.0070.38\pm 0.0070.38 ± 0.007 0.65±0.055plus-or-minus0.650.0550.65\pm 0.0550.65 ± 0.055 0.36±0.004plus-or-minus0.360.0040.36\pm 0.0040.36 ± 0.004 0.56±0.011plus-or-minus0.560.0110.56\pm 0.0110.56 ± 0.011 0.27±0.002plus-or-minus0.270.0020.27\pm 0.0020.27 ± 0.002 0.14±0.005plus-or-minus0.140.0050.14\pm 0.0050.14 ± 0.005 2.99±0.098plus-or-minus2.990.0982.99\pm 0.0982.99 ± 0.098
Random-5 0.65±0.032plus-or-minus0.650.0320.65\pm 0.0320.65 ± 0.032 0.37±0.005plus-or-minus0.370.0050.37\pm 0.0050.37 ± 0.005 0.65±0.028plus-or-minus0.650.0280.65\pm 0.0280.65 ± 0.028 0.36±0.007plus-or-minus0.360.0070.36\pm 0.0070.36 ± 0.007 0.57±0.019plus-or-minus0.570.0190.57\pm 0.0190.57 ± 0.019 0.27±0.005plus-or-minus0.270.0050.27\pm 0.0050.27 ± 0.005 0.14±0.012plus-or-minus0.140.0120.14\pm 0.0120.14 ± 0.012 3.02±0.063plus-or-minus3.020.0633.02\pm 0.0633.02 ± 0.063
Ours-5 0.63±0.030plus-or-minus0.630.0300.63\pm 0.0300.63 ± 0.030 0.38±0.007plus-or-minus0.380.0070.38\pm 0.0070.38 ± 0.007 0.65±0.055plus-or-minus0.650.0550.65\pm 0.0550.65 ± 0.055 0.36±0.004plus-or-minus0.360.0040.36\pm 0.0040.36 ± 0.004 0.56±0.011plus-or-minus0.560.0110.56\pm 0.0110.56 ± 0.011 0.27±0.002plus-or-minus0.270.0020.27\pm 0.0020.27 ± 0.002 0.14±0.005plus-or-minus0.140.0050.14\pm 0.0050.14 ± 0.005 2.99±0.098plus-or-minus2.990.0982.99\pm 0.0982.99 ± 0.098
Random-6 0.64±0.026plus-or-minus0.640.0260.64\pm 0.0260.64 ± 0.026 0.38±0.007plus-or-minus0.380.0070.38\pm 0.0070.38 ± 0.007 0.68±0.015plus-or-minus0.680.0150.68\pm 0.0150.68 ± 0.015 0.35±0.005plus-or-minus0.350.0050.35\pm 0.0050.35 ± 0.005 0.56±0.027plus-or-minus0.560.0270.56\pm 0.0270.56 ± 0.027 0.28±0.004plus-or-minus0.280.0040.28\pm 0.0040.28 ± 0.004 0.15±0.007plus-or-minus0.150.0070.15\pm 0.0070.15 ± 0.007 3.03±0.056plus-or-minus3.030.0563.03\pm 0.0563.03 ± 0.056
Ours-6 0.63±0.030plus-or-minus0.630.0300.63\pm 0.0300.63 ± 0.030 0.38±0.007plus-or-minus0.380.0070.38\pm 0.0070.38 ± 0.007 0.65±0.055plus-or-minus0.650.0550.65\pm 0.0550.65 ± 0.055 0.36±0.004plus-or-minus0.360.0040.36\pm 0.0040.36 ± 0.004 0.56±0.011plus-or-minus0.560.0110.56\pm 0.0110.56 ± 0.011 0.27±0.002plus-or-minus0.270.0020.27\pm 0.0020.27 ± 0.002 0.14±0.005plus-or-minus0.140.0050.14\pm 0.0050.14 ± 0.005 2.99±0.098plus-or-minus2.990.0982.99\pm 0.0982.99 ± 0.098
Ours-7 0.63±0.030plus-or-minus0.630.0300.63\pm 0.0300.63 ± 0.030 0.38±0.007plus-or-minus0.380.0070.38\pm 0.0070.38 ± 0.007 0.65±0.055plus-or-minus0.650.0550.65\pm 0.0550.65 ± 0.055 0.36±0.004plus-or-minus0.360.0040.36\pm 0.0040.36 ± 0.004 0.56±0.011plus-or-minus0.560.0110.56\pm 0.0110.56 ± 0.011 0.27±0.002plus-or-minus0.270.0020.27\pm 0.0020.27 ± 0.002 0.14±0.005plus-or-minus0.140.0050.14\pm 0.0050.14 ± 0.005 2.99±0.098plus-or-minus2.990.0982.99\pm 0.0982.99 ± 0.098
Refer to caption
(a) CelebA
Refer to caption
(b) COP
Refer to caption
(c) ETTm1
Figure 4: Demonstration of the pairwise transfer gains collect by our method, where each row represents the gain from the corresponding task to other tasks.

C.3 Visualized Transfer Gains

The pairwise transfer gains collected by our groups are visualized in Figure 4. The heatmap reveals that nearly all tasks achieve the highest transfer gain with themselves, while exhibiting varying transfer gain distributions across other tasks. This variation underscores an opportunity for improving the overall performance through a task grou** method. Moreover, certain subsets within the heatmap demonstrate overall high transfer gains among tasks, suggesting these tasks form an effective group naturally. It is also noteworthy that the heatmap for ETTm1 (Figure 3(c)) is less structured compared to those for the other two tasks. This lack of structure introduces additional challenges in generating effective grou** results, thereby presenting more difficult scenarios for task grou**.

Appendix D Analysis of Collection Strategy on Transfer Gain

This section presents a comprehensive analysis of the effects of transfer gain collection frequency and the number of epochs required for our method on the final performance metrics.

Table 5 assesses the impact of varying data collection frequencies on the relative improvement and speedup for two datasets: CelebA and ETTm1, with collecting intervals ranging from 1 to 200 for both datasets. In the CelebA dataset (Table 4(a)), relative improvement initially increases with decreasing frequency, suggesting that less frequent data collection, up to a step interval of 50, yields better outcomes than a random policy by potentially reducing over-estimation risk and thus enhancing task grou** performance. Beyond an interval of 50, however, performance diminishes due to the inaccuracy in transfer gain estimation from insufficient data. A similar trend is observed in the ETTm1 dataset (Table 4(b)). This implies that our method can achieve up to a tenfold speedup by employing a lazy collection strategy with intervals of 10 or 50 steps, while maintaining performance and potentially even enhancing it.

(a) CelebA
Steps Relative Improvement ()(\uparrow)( ↑ ) Relative Speedup ()(\uparrow)( ↑ )
1 3.50% 1.00
5 3.53% 4.25
10 3.94% 7.18
25 4.60% 12.16
50 5.13% 15.95
100 1.83% 15.95
200 3.06% 18.62
(b) ETTm1
Steps Relative Improvement ()(\uparrow)( ↑ ) Relative Speedup ()(\uparrow)( ↑ )
1 5.76% 1.00
5 9.48% 4.62
10 9.20% 8.45
25 8.75% 16.64
50 8.75% 24.94
100 6.70% 33.24
200 5.87% 39.16
Table 5: Variation in total test errors across different frequencies of transfer gain collection, compared to test errors from random grou**s, with speedup evaluated based on the computation of transfer gains at each step. The ’Steps’ mentioned in the table correspond to the intervals between successive transfer gain collection steps.

Figure 5 delineates the performance trends across different epochs for the CelebA and ETTm1 datasets. For CelebA, the total error exhibits a marked decrease as the number of epochs increases, stabilizing after approximately 20 epochs. This stabilization suggests that the model quickly benefits from initial training iterations but reaches a plateau, indicating little to no gain from additional training beyond this point. The shaded area. Conversely, the ETTm1 dataset shows a more gradual decline in total Mean Absolute Error (MAE) as the number of epochs grows. The initial drop in MAE is quite steep, suggesting significant learning gains from early training. Subsequently, the MAE curve flattens out after about 6 epochs, which implies that further training yields diminishing improvements in model performance.

Refer to caption
(a) CelebA
Refer to caption
(b) ETTm1
Figure 5: This graph illustrates the performance trends of groups generated by our method, as a function of the number of epochs involved in the construction of transfer gains. Each point represents the mean total error at a specific epoch.

These findings demonstrate that the accumulation of transfer gains does not occur uniformly throughout all training periods; instead, it varies, with certain training phases yielding more substantial enhancements than others. Drawing on these insights, we can deduce several empirical guidelines for optimizing the collection of transfer gains with regard to efficiency: (1) The initial training period is crucial for uncovering task relationships, indicating the importance of concentrating resources on the early stages of training; (2) A range of 5-50 steps is considered optimal for gathering transfer gains, as it is probable that the gains from consecutive steps will be similar.