Panacea: Pareto Alignment via Preference Adaptation for LLMs

Yifan Zhong1,2  ,  Chengdong Ma1∗, Xiaoyuan Zhang3∗, Ziran Yang4, Haojun Chen1
Qingfu Zhang3, Siyuan Qi2, Yaodong Yang1
Equal contribution. 1Institute for Artificial Intelligence, Peking University. 2National Key Laboratory of General Artificial Intelligence, BIGAI. 3Department of Computer Science, City University of Hong Kong. 4Yuanpei College, Peking University. Correspondence to: Yaodong Yang <[email protected]>
Abstract

Current methods for large language model alignment typically use scalar human preference labels. However, this convention tends to oversimplify the multi-dimensional and heterogeneous nature of human preferences, leading to reduced expressivity and even misalignment. This paper presents Panacea, an innovative approach that reframes alignment as a multi-dimensional preference optimization problem. Panacea trains a single model capable of adapting online and Pareto-optimally to diverse sets of preferences without the need for further tuning. A major challenge here is using a low-dimensional preference vector to guide the model’s behavior, despite it being governed by an overwhelmingly large number of parameters. To address this, Panacea is designed to use singular value decomposition (SVD)-based low-rank adaptation, which allows the preference vector to be simply injected online as singular values. Theoretically, we prove that Panacea recovers the entire Pareto front with common loss aggregation methods under mild conditions. Moreover, our experiments demonstrate, for the first time, the feasibility of aligning a single LLM to represent an exponentially vast spectrum of human preferences through various optimization methods. Our work marks a step forward in effectively and efficiently aligning models to diverse and intricate human preferences in a controllable and Pareto-optimal manner.

1 Introduction

AI alignment aims to ensure AI systems align with human intentions, and there has been notable progress in this area, especially for large language models (LLMs) [28, 11, 29, 1]. The prevailing approach for LLM alignment involves curating a dataset {(x,y1,y2,z)}𝑥subscript𝑦1subscript𝑦2𝑧\{(x,y_{1},y_{2},z)\}{ ( italic_x , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_z ) }, where each prompt x𝑥xitalic_x is associated with a pair of responses (y1,y2)subscript𝑦1subscript𝑦2(y_{1},y_{2})( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and a scalar label z{0,1}𝑧01z\in\{0,1\}italic_z ∈ { 0 , 1 } that indicates if y1subscript𝑦1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a “better” response. These labels are typically generated based on detailed guidelines that encompass various criteria, reflecting multiple dimensions i{1,,m}𝑖1𝑚i\in\{1,\cdots,m\}italic_i ∈ { 1 , ⋯ , italic_m } of human preferences (e.g., helpfulness, harmlessness, conciseness, humor, formality). Pre-trained models are subsequently further optimized on this dataset using methods including reinforcement learning, supervised learning, or game-theoretical approaches [26, 41, 31, 5, 43, 3, 46, 39]. However, this single-objective alignment methodology may not fully capture the complexity of real-world scenarios for two reasons (Figure 1).

First, this method can lead to inconsistency and ambiguity in data labels. Human labelers assign scalar labels z𝑧zitalic_z by implicitly evaluating responses across every dimension i𝑖iitalic_i with different preference weights to i𝑖iitalic_i, and reaching a final judgment. These differences often result in conflicting labels, causing misalignment or learning failures (Appendix B), substantiated by the low average label agreement reported in [4]. Second, optimizing a single objective leads to only one model that attempts to fit the potentially conflicting labeling preferences, i.e., the helpfulness-harmlessness dilemma. This single model may not cover the full spectrum of human preferences across all dimensions, thereby exacerbating biases against underrepresented groups and failing to meet diverse user needs.

Refer to caption
Figure 1: Comparison of the predominant single-objective alignment and our multi-dimensional alignment. For the two responses to a prompt, labelers agree on the preferable one in each preference dimension, but conflict when assigning a synthesized scalar label denoting which is “better”. This arises due to the inherently different preference weights held by labelers, a common case in reality. Performing single-objective optimization on the potentially conflicting scalar-label dataset (left) could lead to a dominated solution and misalignment. By contrast, our method, Panacea, leverages multi-dimensional preference optimization (right) on the consistent multi-dimensional dataset and learns the entire Pareto front (PF), thereby aligning with diverse and complex human preferences.

To address these challenges, we formulate the alignment as a multi-dimensional preference optimization (MDPO) problem. By explicitly curating data for each dimension, we enhance data consistency and simplify the labeling process, thereby overcoming the first limitation.

Upon the obtained dataset, our goal is to concurrently optimize across all dimensions. However, this is often infeasible due to potential conflicts among preferences (e.g., helpfulness vs. harmlessness in response to hazardous user requests). Therefore, we aim for Pareto-optimality [38], which means finding solutions where no preference dimension can be made better off without making another worse off. However, many Pareto-optimal solutions might exist. Instead of just learning one such solution, we focus on learning the entire set of Pareto-optimal solutions. To achieve this, we use a single model capable of recovering any Pareto-optimal solution by inputting the appropriate preference vector.

In this paper, we propose Panacea (Pareto alignment via preference adaptation), a simple yet effective method that: 1) learns the entire Pareto-optimal solution set for all possible preferences with a single model, and 2) infers Pareto-optimal responses online by simply injecting any preference vector into the model. Our method, providing a comprehensive representation of human preferences, effectively caters to diverse user needs, thus mitigating the second limitation (Figure 1).

A key challenge lies in how to utilize a low-dimensional preference vector to control the model’s behavior. Our core insight is that, similar to the crucial role of the preference vector in sha** the Pareto solution, singular values are pivotal in defining the model’s fundamental behavior in a singular value decomposition (SVD)-based low-rank adaptation (LoRA)[21, 56]. To address the above challenge, we incorporate the preference vector into the singular values within each SVD-LoRA layer. We then scale it using a learnable factor to align with the magnitude of other singular values. The model is trained end-to-end using a joint objective function aggregated according to the preference vector. The flexibility of Panacea enables seamless compatibility with various preference optimization procedures, e.g., supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF) [41], and direct preference optimization (DPO) [43], and diverse methods for loss aggregation, e.g., linear scalarization (LS) [9][Section 4.7.5] and weighted Tchebycheff (Tche) [38][Section 3.4]. Through theoretical analysis, we confirm that Panacea can effectively capture the entire Pareto front (PF) under practical conditions. This finding provides a solid rationale for training a single Pareto set model to learn all Pareto optimal solutions across the entire preference space.

In our experiments, we assess the effectiveness and scalability of Panacea on several significant and challenging preference alignment problems with up to 10 dimensions, where the Pareto set cardinality grows exponentially with the number of dimensions, considerably surpassing the scope of current research. Panacea consistently outperforms baseline methods, producing superior, uniformly distributed, and convex fronts in accordance with the theory. Quantitative metrics highlight its substantial advantages, demonstrating an order-of-magnitude improvement. Notably, Panacea exhibits no performance saturation even on the ten-dimensional problem, indicating its extensive potential. For the first time, we show the possibility of aligning a single model with exponentially many heterogeneous preferences, opening up a promising avenue for LLM alignment.

This paper makes three main contributions. First, we identify the fundamental limitations of the predominant scalar-label, single-objective alignment paradigm, and propose to reframe alignment as a multi-dimensional preference optimization problem. Second, we design Panacea, a simple yet effective method that learns one single model that can online and Pareto-optimally adapt to any set of preferences, without the need for further tuning. Third, we provide theoretical supports and empirical validations to demonstrate the Pareto optimality, scalability, efficiency, and simplicity of Panacea, thereby satisfying the urgent need for Pareto alignment to diverse human preferences.

2 Related Work

Pareto Set Learning. Different from previous classical multi-objective optimization (MOO) methods [58, 34, 37, 55] that use a finite set of solutions (referred to as “particles") to approximate the entire Pareto set, Pareto set learning (PSL) [40, 35, 57] aims to use a single model to recover the complete Pareto set/front. The advantage of PSL is that it can store an infinite number of Pareto solutions within a model. This allows users to specify their own preferences, and the model can dynamically output a particular Pareto solution in real-time according to those preferences. Typical applications of PSL includes multiobjective industrial design problems [57, 36], reinforcement learning [7, 53, 23], text-to-image generalization [32], and drug design [24, 60]. While there have been some studies on PSL involving deep neural networks, these models are considerably smaller compared to LLMs. Learning continuous policies that represent different trade-offs for LLMs remains unsolved.

Multi-Dimensional Preference Optimization. Existing research primarily treats AI alignment as a single-objective optimization problem with scalar labels [41, 54, 16, 43, 39, 46], often neglecting the complexity of diverse human preferences. Panacea provides an in-depth analysis of this limitation in Appendix B, which is subsequently substantiated by MaxMin-RLHF’s result of “impossibility of alignment” [12] after Panacea first came out. To address this crucial gap, one recent attempt is AlignDiff [18], which trains an attribute-conditioned diffusion model to conduct preference alignment planning in the RL settings. In the realm of LLMs, there are some contemporary works on this topic [59, 25, 17, 20, 50, 51, 52], where the most relevant one Rewarded Soups (RS) [44] adopts a multi-policy strategy. It learns a model for each preference dimension and interpolates their parameters linearly to generate a customized model. However, its simple design also constitutes its drawback. Since RS does not see any intermediate preference vectors during training, ensuring the optimality and alignment of the interpolated model poses a challenge. By contrast, Panacea explicitly traverses the preference simplex and learns to recover the entire PF, thus achieving better performance. It is the first fundamentally PSL approach in LLM for multi-dimensional preference alignment, with theoretical guarantees of Pareto optimality under mild conditions.

Refer to caption
Figure 2: Panacea embeds the preference vector into singular values of each SVD-LoRA layer and scales it with learnable factors to match the magnitudes. During learning, for each data batch, we randomly sample a preference vector from the preference simplex and train the embedded model with various optimization procedures and loss aggregation methods. In the inference stage, the model adapts online to the user-specified preference vector and exhibits Pareto alignment in its responses.

3 Problem Formulation

Human preference is inherently multi-dimensional. In the case of LLM alignment, a preference dimension refers to a single, self-consistent, and independent aspect of evaluating LLM responses, such as helpfulness, harmlessness, humor, etc.. We formulate the multi-dimensional preference optimization (MDPO) problem with m𝑚mitalic_m dimensions as:

maxθΘ𝑱(πθ)=(J1(πθ),J2(πθ),,Jm(πθ)),subscript𝜃Θ𝑱subscript𝜋𝜃subscript𝐽1subscript𝜋𝜃subscript𝐽2subscript𝜋𝜃subscript𝐽𝑚subscript𝜋𝜃\small\max_{\theta\in\Theta}{\bm{J}}(\pi_{\theta})=(J_{1}(\pi_{\theta}),J_{2}(% \pi_{\theta}),\ldots,J_{m}(\pi_{\theta})),roman_max start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT bold_italic_J ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = ( italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) , italic_J start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) , … , italic_J start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) ) , (1)

where πθΠsubscript𝜋𝜃Π\pi_{\theta}\in\Piitalic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ roman_Π is a policy, i.e. an LLM, and θ𝜃\thetaitalic_θ is its trainable parameters (decision variable), ΠΠ\Piroman_Π is the policy space, ΘΘ\Thetaroman_Θ is the parameter space, and Ji,i=1,,mformulae-sequencesubscript𝐽𝑖𝑖1𝑚J_{i},i=1,\cdots,mitalic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , ⋯ , italic_m denotes a performance measure of dimension i𝑖iitalic_i, such as SFT objective JSFT,i(πθ)subscript𝐽SFT𝑖subscript𝜋𝜃J_{\text{SFT},i}(\pi_{\theta})italic_J start_POSTSUBSCRIPT SFT , italic_i end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ), RLHF objective JRLHF,i(πθ)subscript𝐽RLHF𝑖subscript𝜋𝜃J_{\text{RLHF},i}(\pi_{\theta})italic_J start_POSTSUBSCRIPT RLHF , italic_i end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ), and DPO objective JDPO,i(πθ)subscript𝐽DPO𝑖subscript𝜋𝜃J_{\text{DPO},i}(\pi_{\theta})italic_J start_POSTSUBSCRIPT DPO , italic_i end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) detailed in the following equations,

JSFT,i(πθ)=subscript𝐽SFT𝑖subscript𝜋𝜃absent\displaystyle\small J_{\text{SFT},i}(\pi_{\theta})=italic_J start_POSTSUBSCRIPT SFT , italic_i end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = 𝔼(x,y)𝒟i[logπθ(y|x)],subscript𝔼similar-to𝑥𝑦subscript𝒟𝑖delimited-[]subscript𝜋𝜃conditional𝑦𝑥\displaystyle\ \mathbb{E}_{(x,y)\sim\mathcal{D}_{i}}\left[\log\pi_{\theta}(y|x% )\right],blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) ] , (2)
JRLHF,i(πθ)=subscript𝐽RLHF𝑖subscript𝜋𝜃absent\displaystyle J_{\text{RLHF},i}(\pi_{\theta})=italic_J start_POSTSUBSCRIPT RLHF , italic_i end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = 𝔼x𝒟[𝔼yπθ(|x)[ri(x,y)]β𝔻KL[πθ(|x)||πref(|x)]],\displaystyle\ \mathbb{E}_{x\sim\mathcal{D}}\left[\mathbb{E}_{y\sim\pi_{\theta% }(\cdot|x)}\left[r_{i}(x,y)\right]-\beta\mathbb{D}_{\text{KL}}\left[\pi_{% \theta}(\cdot|x)||\pi_{\text{ref}}(\cdot|x)\right]\right],blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) end_POSTSUBSCRIPT [ italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) ] - italic_β blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) | | italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( ⋅ | italic_x ) ] ] , (3)
JDPO,i(πθ)=subscript𝐽DPO𝑖subscript𝜋𝜃absent\displaystyle J_{\text{DPO},i}(\pi_{\theta})=italic_J start_POSTSUBSCRIPT DPO , italic_i end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = 𝔼(x,yw,yl)𝒟i[logσ(βlogπθ(yw|x)πref(yw|x)βlogπθ(yl|x)πref(yl|x))].subscript𝔼similar-to𝑥subscript𝑦𝑤subscript𝑦𝑙subscript𝒟𝑖delimited-[]𝜎𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋refconditionalsubscript𝑦𝑤𝑥𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋refconditionalsubscript𝑦𝑙𝑥\displaystyle\ \mathbb{E}_{(x,y_{w},y_{l})\sim\mathcal{D}_{i}}\left[\log\sigma% \left(\beta\log\frac{\pi_{\theta}\left(y_{w}|x\right)}{\pi_{\mathrm{ref}}\left% (y_{w}|x\right)}-\beta\log\frac{\pi_{\theta}\left(y_{l}|x\right)}{\pi_{\mathrm% {ref}}\left(y_{l}|x\right)}\right)\right].blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG - italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG ) ] . (4)

Notice that 𝒟i,risubscript𝒟𝑖subscript𝑟𝑖\mathcal{D}_{i},r_{i}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent the data and reward model for dimension i𝑖iitalic_i respectively. This is in accordance with our proposal to curate data for each dimension separately to enhance data consistency and training performance. Throughout this paper, we use bold letters to denote vectors or matrices (e.g. 𝑱,𝝀𝑱𝝀\bm{J},{\bm{\lambda}}bold_italic_J , bold_italic_λ). Very often, there does not exist a single solution θ𝜃\thetaitalic_θ that performs optimally on all dimensions due to their conflicts. Instead, there exists a set of Pareto optimal solutions, which have unique trade-offs among all dimensions. We say solution θ(a)superscript𝜃𝑎\theta^{(a)}italic_θ start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT dominates θ(b)superscript𝜃𝑏\theta^{(b)}italic_θ start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT, denoted as 𝑱(πθ(a))𝑱(πθ(b))succeeds𝑱subscript𝜋superscript𝜃𝑎𝑱subscript𝜋superscript𝜃𝑏\bm{J}(\pi_{\theta^{(a)}})\succ{\bm{J}}(\pi_{\theta^{(b)}})bold_italic_J ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ≻ bold_italic_J ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ), if for all i[m]𝑖delimited-[]𝑚i\in[m]italic_i ∈ [ italic_m ], Ji(πθ(a))Ji(πθ(b))subscript𝐽𝑖subscript𝜋superscript𝜃𝑎subscript𝐽𝑖subscript𝜋superscript𝜃𝑏J_{i}(\pi_{\theta^{(a)}})\geq J_{i}(\pi_{\theta^{(b)}})italic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ≥ italic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ), and there exists at least one index j[m]𝑗delimited-[]𝑚j\in[m]italic_j ∈ [ italic_m ] such that Jj(πθ(a))>Jj(πθ(b))subscript𝐽𝑗subscript𝜋superscript𝜃𝑎subscript𝐽𝑗subscript𝜋superscript𝜃𝑏J_{j}(\pi_{\theta^{(a)}})>J_{j}(\pi_{\theta^{(b)}})italic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) > italic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) [19, 38]. Based on this, Pareto optimality is defined as:

Definition 3.1 (Pareto optimality).

We call a solution θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT Pareto optimal if no other solution θΘsuperscript𝜃Θ\theta^{\prime}\in\Thetaitalic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Θ dominates θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. The set of all Pareto optimal solutions is called the Pareto set (PS); while its image set in the objective space is called the Pareto front (PF), 𝒯𝒯{\mathcal{T}}caligraphic_T. A solution θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is considered weakly Pareto optimal if no other solution θsuperscript𝜃\theta^{\prime}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can strictly dominate it, that is, if Ji(πθ)>Ji(πθ)subscript𝐽𝑖subscript𝜋superscript𝜃subscript𝐽𝑖subscript𝜋superscript𝜃J_{i}(\pi_{\theta^{\prime}})>J_{i}(\pi_{\theta^{*}})italic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) > italic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) for all i[m]𝑖delimited-[]𝑚i\in[m]italic_i ∈ [ italic_m ].

Human’s trade-offs among all dimensions are quantified as a preference vector, 𝝀=(λ1,,λm)𝝀subscript𝜆1subscript𝜆𝑚\bm{\lambda}=(\lambda_{1},\ldots,\lambda_{m})bold_italic_λ = ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ), where 𝝀Δm𝝀subscriptΔ𝑚\bm{\lambda}\in\Delta_{m}bold_italic_λ ∈ roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, λi0subscript𝜆𝑖0\lambda_{i}\geq 0italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0, and i=1mλi=1superscriptsubscript𝑖1𝑚subscript𝜆𝑖1\sum_{i=1}^{m}\lambda_{i}=1∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1. Here, λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the weight for preference dimension i𝑖iitalic_i (called preference weight), and ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the preference simplex. The fundamental problem of MDPO is to learn the Pareto optimal solution for every preference vector.

4 Panacea: Pareto Alignment via Preference Adaptation

To solve the MDPO problem, our goal is to learn a single model capable of representing the entire Pareto-optimal solution set. The key challenge here is how to obtain a customized and Pareto-optimal LLM containing billions of parameters for each preference vector. Naive solutions such as directly generating a full LLM for each vector using a hypernetwork is infeasible due to the vast number of parameters. To avoid this, we consider LoRA [21], a parameter-efficient fine-tuning method, which, for each layer, freezes the original weights 𝑾0subscript𝑾0{\bm{W}}_{0}bold_italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and only learns pairs of rank decomposition matrices 𝑨,𝑩𝑨𝑩{\bm{A}},{\bm{B}}bold_italic_A , bold_italic_B for adaptation. According to LoRA, the final weight 𝑾𝑾{\bm{W}}bold_italic_W is obtained by 𝑾=𝑾0+𝑩𝑨𝑾subscript𝑾0𝑩𝑨{\bm{W}}={\bm{W}}_{0}+{\bm{B}}{\bm{A}}bold_italic_W = bold_italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_italic_B bold_italic_A. However, a rank-8 LoRA of Alpaca-7B [47] still contains nearly 20 million parameters, which means producing separate LoRA parameters for each preference vector can also significantly suffer from training difficulty and instability issues. We thus explore an alternative approach inspired by AdaLoRA [56]. This method employs singular value decomposition (SVD)-based LoRA and learns the left singular matrix 𝑼𝑼{\bm{U}}bold_italic_U, diagonal matrix 𝚺𝚺{\bm{\Sigma}}bold_Σ (representing singular values), and right singular matrix 𝑽𝑽{\bm{V}}bold_italic_V. Moreover, 𝑼𝑼{\bm{U}}bold_italic_U and 𝑽𝑽{\bm{V}}bold_italic_V are subject to orthogonality regularization.

𝑾=𝑾0+𝑼𝚺𝑽,𝑾subscript𝑾0𝑼𝚺superscript𝑽top{\bm{W}}={\bm{W}}_{0}+{\bm{U}}{\bm{\Sigma}}{\bm{V}}^{\top},bold_italic_W = bold_italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_italic_U bold_Σ bold_italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , (5)

which hereafter we call SVD-LoRA. By extracting singular values 𝚺𝚺{\bm{\Sigma}}bold_Σ of incremental matrices, SVD-LoRA captures the core features of adaptation in a few parameters. More importantly, the singular values provide an interface to fundamentally influence model behavior.

Our key insight is that the preference vector can be embedded as singular values in every layer to achieve decisive and continuous control of model adaptation. Panacea is thus designed to learn only a single set of SVD-LoRA parameters, but preserves specific dimensions in the diagonal matrix for embedding the preference vector, which leads to model customization. Concretely, for layer l𝑙litalic_l, we preserve k𝑘kitalic_k singular values for learning general and preference-agnostic features and concatenate them with the m𝑚mitalic_m dimensional preference vector 𝝀𝝀{\bm{\lambda}}bold_italic_λ multiplied by a per-weight-matrix learnable scaling factor slsuperscript𝑠𝑙s^{l}italic_s start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT. Therefore, for each weight matrix 𝑾ln1l×n2lsuperscript𝑾𝑙superscriptsubscriptsuperscript𝑛𝑙1subscriptsuperscript𝑛𝑙2{\bm{W}}^{l}\in\mathbb{R}^{n^{l}_{1}\times n^{l}_{2}}bold_italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_n start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, we have 𝑾0ln1l×n2lsuperscriptsubscript𝑾0𝑙superscriptsubscriptsuperscript𝑛𝑙1subscriptsuperscript𝑛𝑙2{\bm{W}}_{0}^{l}\in\mathbb{R}^{n^{l}_{1}\times n^{l}_{2}}bold_italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_n start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, left singular matrix 𝑼l=[𝒖1l,,𝒖kl,𝒖k+1l,,𝒖k+ml]n1l×(k+m)superscript𝑼𝑙subscriptsuperscript𝒖𝑙1subscriptsuperscript𝒖𝑙𝑘subscriptsuperscript𝒖𝑙𝑘1subscriptsuperscript𝒖𝑙𝑘𝑚superscriptsubscriptsuperscript𝑛𝑙1𝑘𝑚{\bm{U}}^{l}=[{\bm{u}}^{l}_{1},\ldots,{\bm{u}}^{l}_{k},{\bm{u}}^{l}_{k+1},% \ldots,{\bm{u}}^{l}_{k+m}]\in\mathbb{R}^{n^{l}_{1}\times(k+m)}bold_italic_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = [ bold_italic_u start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_u start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_u start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , … , bold_italic_u start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + italic_m end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ( italic_k + italic_m ) end_POSTSUPERSCRIPT, diagonal matrix 𝚺l=diag(σ1l,,σkl,slλ1,,slλm)(k+m)×(k+m)superscript𝚺𝑙diagsubscriptsuperscript𝜎𝑙1subscriptsuperscript𝜎𝑙𝑘superscript𝑠𝑙subscript𝜆1superscript𝑠𝑙subscript𝜆𝑚superscript𝑘𝑚𝑘𝑚{\bm{\Sigma}}^{l}=\text{diag}(\sigma^{l}_{1},\ldots,\sigma^{l}_{k},s^{l}% \lambda_{1},\ldots,s^{l}\lambda_{m})\in\mathbb{R}^{(k+m)\times(k+m)}bold_Σ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = diag ( italic_σ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_σ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_k + italic_m ) × ( italic_k + italic_m ) end_POSTSUPERSCRIPT, and right singular matrix 𝑽l=[𝒗1l,,𝒗kl,𝒗k+1l,,𝒗k+ml]n2l×(k+m)superscript𝑽𝑙subscriptsuperscript𝒗𝑙1subscriptsuperscript𝒗𝑙𝑘subscriptsuperscript𝒗𝑙𝑘1subscriptsuperscript𝒗𝑙𝑘𝑚superscriptsubscriptsuperscript𝑛𝑙2𝑘𝑚{\bm{V}}^{l}=[{\bm{v}}^{l}_{1},\ldots,{\bm{v}}^{l}_{k},{\bm{v}}^{l}_{k+1},% \ldots,{\bm{v}}^{l}_{k+m}]\in\mathbb{R}^{n^{l}_{2}\times(k+m)}bold_italic_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = [ bold_italic_v start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_v start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_v start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , … , bold_italic_v start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + italic_m end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × ( italic_k + italic_m ) end_POSTSUPERSCRIPT. The scaling factor is important since we observe that the preference-agnostic singular values commonly range from 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT to 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT in our experiment scenarios, which could be significantly smaller than preference weights, and their magnitudes differ across weight matrices, so both no scaling and a unified scaling are suboptimal. Concerning our design, one may worry whether m𝑚mitalic_m, the dimension of preference vector, is negligible compared to k𝑘kitalic_k. Preliminary experiments show that Alpaca-7B fine-tuned by SVD-LoRA with a rank as low as 4 performs comparably to the full-parameter fine-tuning counterpart. Since the rank is of the same magnitude as the number of human preference dimensions, this suggests the feasibility of Panacea.

During each training iteration, we randomly sample a preference vector from the preference simplex ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, embed it into all weight matrices, and obtain the preference embedded model πθ,𝝀subscript𝜋𝜃𝝀\pi_{\theta,{\bm{\lambda}}}italic_π start_POSTSUBSCRIPT italic_θ , bold_italic_λ end_POSTSUBSCRIPT. We then compute an aggregated objective function of πθ,𝝀subscript𝜋𝜃𝝀\pi_{\theta,{\bm{\lambda}}}italic_π start_POSTSUBSCRIPT italic_θ , bold_italic_λ end_POSTSUBSCRIPT across all preference dimensions according to 𝝀𝝀{\bm{\lambda}}bold_italic_λ, by synthesizing per-dimension objective functions with loss aggregation methods. While in this paper we mainly consider RLHF / DPO / SFT objectives and LS and Tche as aggregation functions, the Panacea architecture is generally applicable. The LS function [9][Section 4.7.5] is given by

maxθg𝝀LS(θ)=maxθi=1mλiJi(πθ),subscript𝜃subscriptsuperscript𝑔LS𝝀𝜃subscript𝜃superscriptsubscript𝑖1𝑚subscript𝜆𝑖subscript𝐽𝑖subscript𝜋𝜃\max_{\theta}g^{\mathrm{LS}}_{\bm{\lambda}}(\theta)=\max_{\theta}\sum\nolimits% _{i=1}^{m}\lambda_{i}J_{i}(\pi_{\theta}),roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT roman_LS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT ( italic_θ ) = roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) , (6)

and the Tche function is defined as,

maxθg𝝀Tche(θ)=maxθmin1imλi(Ji(πθ)zi),subscript𝜃subscriptsuperscript𝑔Tche𝝀𝜃subscript𝜃subscript1𝑖𝑚subscript𝜆𝑖subscript𝐽𝑖subscript𝜋𝜃subscript𝑧𝑖\max_{\theta}g^{\mathrm{Tche}}_{\bm{\lambda}}(\theta)=\max_{\theta}\min_{1\leq i% \leq m}\lambda_{i}(J_{i}(\pi_{\theta})-z_{i}),roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT roman_Tche end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT ( italic_θ ) = roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_m end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) - italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (7)

where 𝒛𝒛{\bm{z}}bold_italic_z is an ideal vector such that ziJi(πθ),θΘ,i[m]formulae-sequencesubscript𝑧𝑖subscript𝐽𝑖subscript𝜋𝜃formulae-sequencefor-all𝜃Θfor-all𝑖delimited-[]𝑚z_{i}\geq J_{i}(\pi_{\theta}),\forall\theta\in\Theta,\forall i\in[m]italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) , ∀ italic_θ ∈ roman_Θ , ∀ italic_i ∈ [ italic_m ]. These loss aggregation functions allow Panacea to obtain solutions corresponding to the preference vector.

With respect to the aggregated objective, trainable parameters for each weight matrix 𝑾lsuperscript𝑾𝑙{\bm{W}}^{l}bold_italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, including 𝑼lsuperscript𝑼𝑙{\bm{U}}^{l}bold_italic_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, 𝑽lsuperscript𝑽𝑙{\bm{V}}^{l}bold_italic_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, (σ1l,,σkl)subscriptsuperscript𝜎𝑙1subscriptsuperscript𝜎𝑙𝑘(\sigma^{l}_{1},\ldots,\sigma^{l}_{k})( italic_σ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_σ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), slsuperscript𝑠𝑙s^{l}italic_s start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, are then updated via gradient descent. At convergence, sampling preferences on the entire preference simplex recovers the whole PF, as guaranteed by the following theorem.

Theorem 4.1.

Panacea recovers the entire Pareto front for both LS and Tche aggregation functions (Equations 6 and 7) under the following assumptions: 1. Panacea with SVD-LoRA has sufficient representation capability for all preferences 𝛌Δm𝛌subscriptΔ𝑚{\bm{\lambda}}\in\Delta_{m}bold_italic_λ ∈ roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Specifically, for any preference vector 𝛌𝛌{\bm{\lambda}}bold_italic_λ, the policy πθ,𝛌subscript𝜋𝜃𝛌\pi_{\theta,{\bm{\lambda}}}italic_π start_POSTSUBSCRIPT italic_θ , bold_italic_λ end_POSTSUBSCRIPT can optimize the corresponding aggregation functions (Equations 6 and 7) to their maximum values. 2. For a specific preference vector 𝛌𝛌{\bm{\lambda}}bold_italic_λ, the LLM policy space formed by all πθ,𝛌subscript𝜋𝜃𝛌\pi_{\theta,{\bm{\lambda}}}italic_π start_POSTSUBSCRIPT italic_θ , bold_italic_λ end_POSTSUBSCRIPT can represent all categorical output distributions of responses.
By optimizing the Panacea objective function 𝔼𝛌Δm[g𝛌agg(θ)]subscript𝔼𝛌subscriptΔ𝑚delimited-[]subscriptsuperscript𝑔agg𝛌𝜃\mathbb{E}_{{\bm{\lambda}}\in\Delta_{m}}\left[g^{\mathrm{agg}}_{\bm{\lambda}}(% \theta)\right]blackboard_E start_POSTSUBSCRIPT bold_italic_λ ∈ roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_g start_POSTSUPERSCRIPT roman_agg end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT ( italic_θ ) ], where g𝛌agg=g𝛌LS/g𝛌Tchesubscriptsuperscript𝑔agg𝛌subscriptsuperscript𝑔LS𝛌subscriptsuperscript𝑔Tche𝛌g^{\mathrm{agg}}_{\bm{\lambda}}=g^{\mathrm{LS}}_{\bm{\lambda}}/g^{\mathrm{Tche% }}_{\bm{\lambda}}italic_g start_POSTSUPERSCRIPT roman_agg end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT = italic_g start_POSTSUPERSCRIPT roman_LS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT / italic_g start_POSTSUPERSCRIPT roman_Tche end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT, the optimal policy found by Panacea can recover the entire Pareto front for almost every preference.

For proof, see Appendix C. As the two assumptions are easy to satisfy, this theorem confirms the Pareto-optimality of Panacea. Panacea also achieves fine-grained control of model behavior through preference embedding, making it a suitable solution to the MDPO problem. In the inference stage, the user can specify a preference vector and obtain the corresponding Pareto optimal model that aligns with his/her preference. We present a visual illustration of Panacea in Figure 2.

Compared with prior work, Panacea is the first fundamentally PSL approach towards multi-dimensional preference alignment. It only needs to learn and maintain one model to represent the PF, which is more computationally efficient than both the Discrete Policy Solutions (DPS) method [33, 6], which learns a model for every preference vector, and RS, which approximates the PF with m𝑚mitalic_m models optimized exclusively on the m𝑚mitalic_m preference dimensions. Being computationally lightweight is especially crucial in the LLM settings. Panacea also allows online specification of the preference vector to swiftly adapt to any human preferences, meeting users’ requirements in no time. Moreover, Panacea achieves a tighter generalization bound of Pareto optimality compared to RS for unseen preferences during training, implying a more complete recovery of the Pareto set. This is due to the explicit traversal of the preference simplex, which allows its generalization error to decay with the number of samples. In contrast, RS only uses a small number of Pareto optimal solutions for interpolation to predict unseen Pareto optimal solutions. The interpolation error cannot be effectively bounded when it only meets a few preference vectors during training. Finally, Panacea preserves explainability to some extent. For each weight matrix 𝑾lsuperscript𝑾𝑙{\bm{W}}^{l}bold_italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, Panacea adapts it as

𝑾l=𝑾0l+𝑼l𝚺l𝑽l=𝑾0l+i=1kσil𝒖il𝒗il[1]+i=1mslλi𝒖k+il𝒗k+il[2].superscript𝑾𝑙subscriptsuperscript𝑾𝑙0superscript𝑼𝑙superscript𝚺𝑙superscriptsuperscript𝑽𝑙topsubscriptsuperscript𝑾𝑙0subscriptsubscriptsuperscript𝑘𝑖1subscriptsuperscript𝜎𝑙𝑖subscriptsuperscript𝒖𝑙𝑖superscriptsubscriptsuperscript𝒗𝑙𝑖topdelimited-[]1subscriptsubscriptsuperscript𝑚𝑖1superscript𝑠𝑙subscript𝜆𝑖subscriptsuperscript𝒖𝑙𝑘𝑖superscriptsubscriptsuperscript𝒗𝑙𝑘𝑖topdelimited-[]2\small{\bm{W}}^{l}={\bm{W}}^{l}_{0}+{\bm{U}}^{l}{\bm{\Sigma}}^{l}{{\bm{V}}^{l}% }^{\top}={\bm{W}}^{l}_{0}+\underbrace{\sum\nolimits^{k}_{i=1}\sigma^{l}_{i}{% \bm{u}}^{l}_{i}{{\bm{v}}^{l}_{i}}^{\top}}_{[1]}+\underbrace{\sum\nolimits^{m}_% {i=1}s^{l}\lambda_{i}{\bm{u}}^{l}_{k+i}{{\bm{v}}^{l}_{k+i}}^{\top}}_{[2]}.bold_italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = bold_italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_italic_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT bold_Σ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT bold_italic_V start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + under⏟ start_ARG ∑ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_u start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_v start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT + under⏟ start_ARG ∑ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_u start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + italic_i end_POSTSUBSCRIPT bold_italic_v start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT [ 2 ] end_POSTSUBSCRIPT . (8)

Intuitively, term [1]delimited-[]1[1][ 1 ] captures shared features among preference dimensions, while term [2]delimited-[]2[2][ 2 ] learns dimension-specific adaptations and weights them by the preference vector to achieve Pareto alignment. The decoupling of learned parameters not only illustrates the mechanism of Panacea, but also leads to superior robustness of its preference adaptation strategy (further analyzed in Section E.5).

Table 1: This table compares algorithm performance using MOO metrics across all experiment evaluations. An upward arrow (\uparrow) means a larger value for this metric is better, whereas a downward arrow (\downarrow) indicates the opposite. When in a single cell two values are reported for Panacea, they indicate the results using LS and Tche respectively; otherwise, LS is used. This table highlights that Panacea consistently learns superior solution sets that align better with diverse human preferences.
Hypervolume \uparrow Inner product \uparrow Sparsity \downarrow Spacing \downarrow
Experiment Model Optim. RS Panacea RS Panacea RS Panacea RS Panacea
HH Llama1-ft RLHF 517.28517.28517.28517.28 915.04915.04\mathbf{915.04}bold_915.04 11.2611.2611.2611.26 14.2714.27\mathbf{14.27}bold_14.27 7392.917392.917392.917392.91 2758.592758.59\mathbf{2758.59}bold_2758.59 329.53329.53329.53329.53 207.19207.19\mathbf{207.19}bold_207.19
Llama1-ft DPO 0.3190.3190.3190.319 0.3220.322\mathbf{0.322}bold_0.322 / 0.3170.3170.3170.317 0.6320.6320.6320.632 0.6390.639\mathbf{0.639}bold_0.639 / 0.6370.6370.6370.637 0.480.480.480.48 0.30.3\mathbf{0.3}bold_0.3 / 0.950.950.950.95 2.882.882.882.88 2.512.51\mathbf{2.51}bold_2.51 / 3.253.253.253.25
Llama2-ft RLHF 519.38519.38519.38519.38 840.45840.45\mathbf{840.45}bold_840.45 8.598.598.598.59 14.6814.68\mathbf{14.68}bold_14.68 890.4890.4\mathbf{890.4}bold_890.4 5332.885332.885332.885332.88 90.3890.38\mathbf{90.38}bold_90.38 275.7275.7275.7275.7
Llama2-ft DPO 0.3180.3180.3180.318 0.3370.337\mathbf{0.337}bold_0.337 / 0.3340.3340.3340.334 0.6410.6410.6410.641 0.6530.653\mathbf{0.653}bold_0.653 / 0.6520.6520.6520.652 0.730.730.730.73 0.360.36\mathbf{0.36}bold_0.36 / 0.530.530.530.53 3.243.243.243.24 3.123.12\mathbf{3.12}bold_3.12 / 3.713.713.713.71
HHC Llama2-ft RLHF 13519135191351913519 𝟏𝟕𝟎𝟗𝟕17097\mathbf{17097}bold_17097 5.375.375.375.37 9.199.19\mathbf{9.19}bold_9.19 211.96211.96211.96211.96 48.4448.44\mathbf{48.44}bold_48.44 65.1565.15\mathbf{65.15}bold_65.15 65.7865.7865.7865.78
Llama2-ft DPO 0.1710.1710.1710.171 0.1770.177\mathbf{0.177}bold_0.177 0.640.640.640.64 0.650.65\mathbf{0.65}bold_0.65 0.10.10.10.1 0.060.06\mathbf{0.06}bold_0.06 1.981.98\mathbf{1.98}bold_1.98 2.452.452.452.45
Chat 3-dim Llama3-Instruct SFT 0.290.290.290.29 0.500.50\mathbf{0.50}bold_0.50 0.580.58-0.58- 0.58 0.420.42\mathbf{-0.42}- bold_0.42 0.680.680.680.68 0.040.04\mathbf{0.04}bold_0.04 6.376.376.376.37 2.132.13\mathbf{2.13}bold_2.13
Chat 4-dim Llama3-Instruct SFT 0.140.140.140.14 0.380.38\mathbf{0.38}bold_0.38 0.650.65-0.65- 0.65 0.430.43\mathbf{-0.43}- bold_0.43 0.250.250.250.25 0.020.02\mathbf{0.02}bold_0.02 5.065.065.065.06 2.172.17\mathbf{2.17}bold_2.17
Chat 5-dim Llama3-Instruct SFT 0.080.080.080.08 0.330.33\mathbf{0.33}bold_0.33 0.660.66-0.66- 0.66 0.420.42\mathbf{-0.42}- bold_0.42 0.140.140.140.14 0.020.02\mathbf{0.02}bold_0.02 4.914.914.914.91 2.282.28\mathbf{2.28}bold_2.28
Chat 10-dim Llama3-Instruct SFT 0.010.010.010.01 0.120.12\mathbf{0.12}bold_0.12 0.660.66-0.66- 0.66 0.470.47\mathbf{-0.47}- bold_0.47 0.030.030.030.03 0.010.01\mathbf{0.01}bold_0.01 3.943.943.943.94 2.192.19\mathbf{2.19}bold_2.19

5 Experiments

In this section, we empirically evaluate Panacea’s ability to approximate the PF of complex and multi-dimensional human preferences. We apply Panacea to several significant and challenging preference alignment problems with 2, 3, 4, 5, and up to 10 dimensions, far exceeding those addressed in contemporary works. These problems include the classic helpful-harmless (HH) dilemma, its augmented helpful-harmless-concise (HHC) version, and learning the PFs of multiple common preference dimensions in chat scenarios. While the number of dimensions m𝑚mitalic_m varies, we keep the preference-agnostic rank k𝑘kitalic_k of Panacea fixed to 8888 and observe Panacea’s performance. Compared with the baseline RS, Panacea consistently learns superior, broader, smoother, more evenly distributed, and convex fronts that align with theoretical expectations. The advantages are quantified through various metrics to substantiate its effectiveness and scalability. Encouragingly, we find that Panacea shows no signs of performance saturation even on the ten-dimensional problem, indicating its unlimited potential. We also conduct ablation studies to validate the design of Panacea. Full experimental details are elaborated in Appendix E, and chat cases are presented in Appendix F.

5.1 Mastering Dual Dimensions: Addressing the Helpful-Harmless Dilemma

Refer to caption
Figure 3: Algorithm performance on HH. Baseline methods (RS and DPS) require training a separate model for each preference dimension/vector, whereas Panacea learns a single adaptable model. Left: Panacea is significantly better than RS and even outperforms DPS, showing its superiority in learning PF while being more efficient. Middle: on Llama2-ft across different seeds, Panacea again consistently outperforms RS, and its fronts exhibit smooth convex shapes that correspond with theory. Right: with DPO, Panacea using both LS and Tche aggregation learns better fronts than RS.

In the first set of experiments, algorithms are tasked with two-dimensional preference alignment using various initial models, i.e. Alpaca-finetuned [47] Llama1-7B-base [48](abbv. Llama1-ft) and Llama2-7B-base [49] (abbv. Llama2-ft), optimization procedures, i.e. RLHF and DPO, and loss aggregation methods, i.e. LS and Tche. Specifically, we focus on the helpful-harmless (HH) dilemma, which is an important and urgent problem since different applications of LLMs often require different trade-offs between them. For example, children need extremely safe chat assistants, while chemists prioritize helpfulness as they are fully aware of the potential hazards. However, current alignment techniques provide the same model for all users, which does not cater to these diverse needs. Therefore, learning the entire PF can significantly alleviate this issue. We use the BeaverTails dataset [27], which has preference labels for both helpfulness and harmlessness.

Refer to caption
Figure 4: Responses of the model to the same user prompt with two extreme preference vectors. Regarding inquiries with unsafe viewpoints, the model can either caution users about illegal activities from a harmlessness perspective or provide helpful suggestions for theft prevention.

In Figure 3 left, we show the learned fronts of algorithms with the task configuration of Llama1-ft, RLHF, and LS aggregation. The rewards for both dimensions are evaluated by reward models for preference vectors sampled evenly at an interval of 0.10.10.10.1, i.e. 𝝀=(0.0,1.0),(0.1,0.9),,(1.0,0.0)𝝀0.01.00.10.91.00.0{\bm{\lambda}}=(0.0,1.0),(0.1,0.9),\ldots,(1.0,0.0)bold_italic_λ = ( 0.0 , 1.0 ) , ( 0.1 , 0.9 ) , … , ( 1.0 , 0.0 ). Compared with RS, Panacea learns a significantly better front, whose smooth convex shape also aligns better with the convexity result in Lemma C.3. In this experiment, we also test Discrete Policy Solutions (DPS) [33, 6], also known as multi-objective RLHF (MORL) in [44], which learns a separate model for each preference vector (11 models in this case) and is commonly considered as the performance upper bound for this problem. Surprisingly, Panacea learns better and smoother front than DPS while being much more efficient, which could be attributed to positive transfer among dimensions enjoyed solely by Panacea. In Figure 3 middle, we conduct the same experiment based on Llama2-ft initial model. Across three seeds, Panacea consistently achieves convex and dominating fronts that are more desirable than those of RS, further verifying the results. To clearly demonstrate how the model’s output changes with variations in the preference vector, we present an exemplar chat case in Figure 4 and its detailed version in Appendix F. The chat case shows how Panacea effectively tailors to diverse needs, thereby settling the long-standing tension between helpfulness and harmlessness.

To further study the generality of Panacea, we conduct experiments with Llama2-ft, DPO, and LS / Tche aggregation, where Panacea is optimized based on Appendix D and Appendix D respectively. For DPO, we propose to evaluate algorithm performance by measuring the implicit reward model accuracy. That is, for a model πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, it is accurate on a labeled pair (x,ywi,yli)𝑥subscriptsuperscript𝑦𝑖𝑤subscriptsuperscript𝑦𝑖𝑙(x,y^{i}_{w},y^{i}_{l})( italic_x , italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) if βlogπθ(ywi|x)πref(ywi|x)>βlogπθ(yli|x)πref(yli|x)𝛽subscript𝜋𝜃conditionalsuperscriptsubscript𝑦𝑤𝑖𝑥subscript𝜋refconditionalsubscriptsuperscript𝑦𝑖𝑤𝑥𝛽subscript𝜋𝜃conditionalsubscriptsuperscript𝑦𝑖𝑙𝑥subscript𝜋refconditionalsubscriptsuperscript𝑦𝑖𝑙𝑥\beta\log\frac{\pi_{\theta}\left(y_{w}^{i}|x\right)}{\pi_{\mathrm{ref}}\left(y% ^{i}_{w}|x\right)}>\beta\log\frac{\pi_{\theta}\left(y^{i}_{l}|x\right)}{\pi_{% \mathrm{ref}}\left(y^{i}_{l}|x\right)}italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG > italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG, and its total accuracy is obtained by averaging over dataset. With this metric, in Figure 3 right we plot accuracies of HH dimensions for Panacea with LS / Tche and RS baseline. Results again confirm that Panacea always obtains better fronts.

Aside from comparing the fronts learned by Panacea and the baseline, we also quantify the advantage of Panacea by computing four MOO metrics in Table 1. Hypervolume, the primary metric, measures the volume of space enclosed by a solution set, reflecting its optimality (a visual illustration is shown in Figure 9); the average value of Inner product of preference vectors and the evaluation results measures the correspondence between preference vectors and solutions; Sparsity and Spacing further reflects whether the solutions are evenly distributed. Mathematical expressions of these metrics are detailed in Section E.4. Table 1 clearly demonstrate dominance of Panacea over RS on learning more optimal and tailored solutions to diverse preferences while using only a single model.

5.2 Navigating Tri-Dimensional Trade-offs: Helpful, Harmless, and Concise Alignment

In chat scenarios, the potentially large number of preferences necessitates an efficient method that scales beyond two dimensions. Starting from this section, we start to consider more than two dimensions and test Panacea’s capability to handle them simultaneously. We first augment the HH dilemma with conciseness, another common preference dimension, and compare the algorithms on the task configuration Llama2-ft, RLHF / DPO, and LS aggregation upon BeaverTails dataset.

Refer to caption
Figure 5: Learned fronts of Panacea (red) and RS (blue) on HHC problem with Llama2-ft, RLHF, and LS aggregation. Panacea learns a better and more evenly distributed front while solutions of RS clutter in a corner. This suggests Panacea provides fine-grained solutions to diverse human preferences.

For RLHF, the concise RM is defined as a rectified affine function that assigns higher rewards to shorter responses; for DPO, the shorter response to each prompt is preferred in the conciseness dimension (details provided in Appendix E). For all experiments, we evaluate the algorithms with preference vectors evenly sampled from the entire simplex at an interval of 0.20.20.20.2, i.e. 𝝀=(0.0,0.0,1.0),(0.0,0.2,0.8),,(1.0,0.0,0.0)𝝀0.00.01.00.00.20.81.00.00.0{\bm{\lambda}}=(0.0,0.0,1.0),(0.0,0.2,0.8),\ldots,(1.0,0.0,0.0)bold_italic_λ = ( 0.0 , 0.0 , 1.0 ) , ( 0.0 , 0.2 , 0.8 ) , … , ( 1.0 , 0.0 , 0.0 ), and provide the results in Figure 5 and Table 1.

Figure 5 visualizes the fronts learned with RLHF procedure. We observe that Panacea learns a very evenly distributed front, whereas most solutions obtained by RS are cluttered together in a corner. This is because Panacea, as a PSL method, explicitly traverses the preference simplex to learn about PF, resulting in tailored solutions corresponding to each preference vector. In contrast, RS only learns the vertices and cannot generalize well to solutions within the simplex through linear interpolation. Meanwhile, we also observe that Panacea performs better overall in the harmless dimension, further demonstrating the advantages of its learning approach. MOO metrics in Table 1 again numerically depict the benefits of Panacea, and the chat case in Appendix F serves as qualitative support. Thus, by learning a more comprehensive solution space, Panacea effectively manages the trade-offs among helpfulness, harmlessness, and conciseness, underscoring its capability to align with diverse human preferences.

5.3 Scaling Up: Towards Tens-of-Dimensional Pareto Alignment with a Single Model

Refer to caption
Figure 6: Comparison of learned fronts on Chat 3-dim problem. On the left we show a 3D visualization of Panacea (red) and RS (blue) and on the right we show 2D projections by setting one of preference weights to zero. Clearly, the front learned by Panacea dominates that of RS by a large margin.

We further test Panacea’s scalability on three, four, five, and up to ten-dimensional alignment problems (abbv. Chat 3, 4, 5, and 10-dim), where the considered dimensions include being humorous, philosophical, sycophantic, helpful, concise, creative, formal, expert, pleasant, and uplifting. These dimensions reflect the common scenario where desired chat properties are not simultaneously attainable. Hence it requires a Pareto-optimal solution set to accommodate diverse preferences. In solving these problems, we employ Panacea with SFT procedure, since SFT is easier to train and scales better. The initial model used in this series of experiments is Llama-3-8B-Instruct [2] (abbv. Llama3-Instruct), and the loss aggregation function is LS. We first curate data for each dimension by prompting Llama3-Instruct to generate responses to Alpaca instructions with the corresponding property (details are provided in Appendix E). Panacea is then trained using LS aggregated SFT loss. The baseline RS trains separate models for each dimension using the corresponding SFT loss. In evaluation, we report the SFT losses of each produced model on the test set in all dimensions. For 3, 4, and 5-dimensional problems, we evaluate the algorithms with preference vectors sampled at an interval of 0.20.20.20.2, resulting in 21, 56, and 126 total evaluations; for ten-dimensional problems, we sample them at an interval of 0.250.250.250.25, amounting to 715 in total. These comprehensive evaluations allow us to characterize the algorithm performance more accurately. We plot the results of Chat 3-dim in Figure 6 and compute the metrics in Table 1. Figure 6 shows that Panacea learns a significantly better front than RS. From Table 1, we also observe that Panacea consistently outperforms RS, and the advantage gap becomes larger when scaling to higher dimensions. Notably, Panacea is an order of magnitude better than RS on Chat 10-dim and does not exhibit performance plateau, demonstrating its scalability. We provide a chat case in Appendix F from Chat 3-dim to show Panacea’s performance. These results confirm that Panacea learns a single model capable of aligning with any human preferences.

5.4 Ablation Study and Analysis

Refer to caption
Figure 7: Left: Ablation study on the learnable preference vector scaling factor. Predefined scaling factors ranging from 1111 to 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT all result in significantly worse fronts than the learnable approach, indicating the importance of the per-weight-matrix learnable scaling factor. Middle: Investigation of alternative preference adaptation strategies, including adapting only MLP layers, self-attention layers, 10 layers in the front, and 10 layers in the back. Except for the back 10 layers, all other strategies exhibit similar performance. Thus, we decide to adapt all layers for better representation capacity. Right: We show the fronts learned by Panacea at different RLHF steps. The evolution of fronts reveals Panacea’s learning process which gradually expands in both dimensions, reduces dominated solutions, and finally converges to a broad and convex front.

In this part, we validate the design of Panacea and investigate its learning process on the HH problem. We first analyze the effect of the per-weight-matrix learnable scaling factor slsuperscript𝑠𝑙s^{l}italic_s start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT. Intuitively, it scales preference vectors to the same magnitude as the singular values to avoid either dominant or negligible influence of preference-specific features on 𝑾lsuperscript𝑾𝑙{\bm{W}}^{l}bold_italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, as observed from the learned parameters. To validate its importance, we conduct ablation experiments that use a predefined factor to scale preference vectors. Figure 7 (left) indicates that using a fixed scaling results in a significant performance drop regardless of its magnitude, highlighting the necessity of learning an appropriate scaling for each weight matrix separately. We also explore alternative strategies of preference adaptation, which only adapt self-attention layers, MLP layers, the 10 layers in the front, and the 10 layers in the back. Figure 7 (middle) suggests that except for only adapting the back 10 layers, all other strategies perform comparably. Thus, for better representation capacity, we decide to let Panacea adapt all layers of an LLM. Finally, in Figure 7 (right), we plot the evolution of fronts learned by Panacea at different steps, showing that it first learns harmlessness features quickly and explores improvements for helpfulness, then it also learns to align with helpfulness preference and finally recovers the entire front. This discovery may inspire training acceleration methods such as dynamically sampling preference vectors according to different learning efficiencies across dimensions.

6 Conclusion

This paper presents Panacea, the first Pareto set learning approach towards solving Pareto alignment with multi-dimensional human preference using a single model. Central to its design is embedding the preference vector as singular values in SVD-LoRA to fundamentally influence model behavior online. Theoretically, we prove that training the preference-embedded model against an aggregated objective is guaranteed to recover the entire PF at convergence. Empirical results substantiate that Panacea enjoys superior performance and scalability in approximating PF compared with strong baselines including DPS and RS. Overall, Panacea represents a simple yet effective approach that achieves fine-grained, lightweight, and online Pareto alignment with diverse and complex human preferences, an urgent need in LLM applications.

References

  • Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • AI@Meta [2024] AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
  • Azar et al. [2023] Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
  • Bai et al. [2022a] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  • Bai et al. [2022b] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
  • Barrett and Narayanan [2008] Leon Barrett and Srini Narayanan. Learning all optimal policies with multiple criteria. In Proceedings of the 25th international conference on Machine learning, pages 41–47, 2008.
  • Basaklar et al. [2022] Toygun Basaklar, Suat Gumussoy, and Umit Y Ogras. Pd-morl: Preference-driven multi-objective reinforcement learning algorithm. arXiv preprint arXiv:2208.07914, 2022.
  • Berry et al. [2011] Kenneth J Berry, Janis E Johnston, and Paul W Mielke Jr. Permutation methods. Wiley Interdisciplinary Reviews: Computational Statistics, 3(6):527–542, 2011.
  • Boyd and Vandenberghe [2004] Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
  • Bradley and Terry [1952] Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  • Casper et al. [2023] Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, et al. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
  • Chakraborty et al. [2024] Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, and Mengdi Wang. Maxmin-rlhf: Towards equitable alignment of large language models with diverse human preferences. arXiv preprint arXiv:2402.08925, 2024.
  • Choo and Atkins [1983] Eng Ung Choo and DR Atkins. Proper efficiency in nonconvex multicriteria programming. Mathematics of Operations Research, 8(3):467–470, 1983.
  • Dai et al. [2023] Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023.
  • Deb et al. [2002] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE transactions on evolutionary computation, 6(2):182–197, 2002.
  • Dong et al. [2023a] Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023a.
  • Dong et al. [2023b] Yi Dong, Zhilin Wang, Makesh Narsimhan Sreedhar, Xianchao Wu, and Oleksii Kuchaiev. Steerlm: Attribute conditioned sft as an (user-steerable) alternative to rlhf. arXiv preprint arXiv:2310.05344, 2023b.
  • Dong et al. [2023c] Zibin Dong, Yifu Yuan, Jianye Hao, Fei Ni, Yao Mu, Yan Zheng, Yu**g Hu, Tangjie Lv, Changjie Fan, and Zhipeng Hu. Aligndiff: Aligning diverse human preferences via behavior-customisable diffusion model. arXiv preprint arXiv:2310.02054, 2023c.
  • Ehrgott [2005] Matthias Ehrgott. Multicriteria optimization, volume 491. Springer Science & Business Media, 2005.
  • Guo et al. [2024] Yiju Guo, Ganqu Cui, Lifan Yuan, Ning Ding, Jiexin Wang, Huimin Chen, Bowen Sun, Ruobing Xie, Jie Zhou, Yankai Lin, et al. Controllable preference optimization: Toward controllable multi-objective alignment. arXiv preprint arXiv:2402.19085, 2024.
  • Hu et al. [2022] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  • Hu et al. [2024] Yuzheng Hu, Ruicheng Xian, Qilong Wu, Qiuling Fan, Lang Yin, and Han Zhao. Revisiting scalarization in multi-task learning: A theoretical perspective. Advances in Neural Information Processing Systems, 36, 2024.
  • Hwang et al. [2023] Minyoung Hwang, Luca Weihs, Chanwoo Park, Kimin Lee, Aniruddha Kembhavi, and Kiana Ehsani. Promptable behaviors: Personalizing multi-objective rewards from human preferences. arXiv preprint arXiv:2312.09337, 2023.
  • Jain et al. [2023] Moksh Jain, Sharath Chandra Raparthy, Alex Hernández-Garcıa, Jarrid Rector-Brooks, Yoshua Bengio, Santiago Miret, and Emmanuel Bengio. Multi-objective gflownets. In International Conference on Machine Learning, pages 14631–14653. PMLR, 2023.
  • Jang et al. [2023] Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Ye** Choi, and Prithviraj Ammanabrolu. Personalized soups: Personalized large language model alignment via post-hoc parameter merging. arXiv preprint arXiv:2310.11564, 2023.
  • Jaques et al. [2019] Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019.
  • Ji et al. [2023a] Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023a.
  • Ji et al. [2023b] Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang, et al. Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852, 2023b.
  • Kaufmann et al. [2023] Timo Kaufmann, Paul Weng, Viktor Bengs, and Eyke Hüllermeier. A survey of reinforcement learning from human feedback. arXiv preprint arXiv:2312.14925, 2023.
  • Kwon et al. [2023] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  • Lee et al. [2023] Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
  • Lee et al. [2024] Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Junfeng He, et al. Parrot: Pareto-optimal multi-reward reinforcement learning framework for text-to-image generation. arXiv preprint arXiv:2401.05675, 2024.
  • Li et al. [2020] Kaiwen Li, Tao Zhang, and Rui Wang. Deep reinforcement learning for multiobjective optimization. IEEE transactions on cybernetics, 51(6):3103–3114, 2020.
  • Lin et al. [2019] Xi Lin, Hui-Ling Zhen, Zhenhua Li, Qing-Fu Zhang, and Sam Kwong. Pareto multi-task learning. Advances in neural information processing systems, 32, 2019.
  • Lin et al. [2020] Xi Lin, Zhiyuan Yang, Qingfu Zhang, and Sam Kwong. Controllable pareto multi-task learning. arXiv preprint arXiv:2010.06313, 2020.
  • Lin et al. [2022] Xi Lin, Zhiyuan Yang, Xiaoyuan Zhang, and Qingfu Zhang. Pareto set learning for expensive multi-objective optimization. Advances in Neural Information Processing Systems, 35:19231–19247, 2022.
  • Liu et al. [2021] Xingchao Liu, Xin Tong, and Qiang Liu. Profiling pareto front with multi-objective stein variational gradient descent. Advances in Neural Information Processing Systems, 34:14721–14733, 2021.
  • Miettinen [1999] Kaisa Miettinen. Nonlinear multiobjective optimization, volume 12. Springer Science & Business Media, 1999.
  • Munos et al. [2023] Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi, et al. Nash learning from human feedback. arXiv preprint arXiv:2312.00886, 2023.
  • Navon et al. [2020] Aviv Navon, Aviv Shamsian, Gal Chechik, and Ethan Fetaya. Learning the pareto front with hypernetworks. arXiv preprint arXiv:2010.04104, 2020.
  • Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  • Peters and Schaal [2007] Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745–750, 2007.
  • Rafailov et al. [2023] Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  • Rame et al. [2023] Alexandre Rame, Guillaume Couairon, Mustafa Shukor, Corentin Dancette, Jean-Baptiste Gaya, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. arXiv preprint arXiv:2306.04488, 2023.
  • Roijers et al. [2015] Diederik Marijn Roijers, Shimon Whiteson, and Frans A Oliehoek. Computing convex coverage sets for faster multi-objective coordination. Journal of Artificial Intelligence Research, 52:399–443, 2015.
  • Swamy et al. [2024] Gokul Swamy, Christoph Dann, Rahul Kidambi, Zhiwei Steven Wu, and Alekh Agarwal. A minimaximalist approach to reinforcement learning from human feedback. arXiv preprint arXiv:2401.04056, 2024.
  • Taori et al. [2023] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
  • Touvron et al. [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  • Touvron et al. [2023b] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  • Wang et al. [2024] Haoxiang Wang, Yong Lin, Wei Xiong, Rui Yang, Shizhe Diao, Shuang Qiu, Han Zhao, and Tong Zhang. Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards. arXiv preprint arXiv:2402.18571, 2024.
  • Yang et al. [2024a] Kailai Yang, Zhiwei Liu, Qianqian Xie, Tianlin Zhang, Nirui Song, Jimin Huang, Ziyan Kuang, and Sophia Ananiadou. Metaaligner: Conditional weak-to-strong correction for generalizable multi-objective alignment of language models. arXiv preprint arXiv:2403.17141, 2024a.
  • Yang et al. [2024b] Rui Yang, Xiaoman Pan, Feng Luo, Shuang Qiu, Han Zhong, Dong Yu, and Jianshu Chen. Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference adjustment. arXiv preprint arXiv:2402.10207, 2024b.
  • Yang et al. [2019] Runzhe Yang, Xingyuan Sun, and Karthik Narasimhan. A generalized algorithm for multi-objective reinforcement learning and policy adaptation. Advances in neural information processing systems, 32, 2019.
  • Yuan et al. [2023] Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
  • Zhang and Li [2007] Qingfu Zhang and Hui Li. Moea/d: A multiobjective evolutionary algorithm based on decomposition. IEEE Transactions on evolutionary computation, 11(6):712–731, 2007.
  • Zhang et al. [2023a] Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=lq62uWRJjiY.
  • Zhang et al. [2023b] Xiaoyuan Zhang, Xi Lin, Bo Xue, Yifan Chen, and Qingfu Zhang. Hypervolume maximization: A geometric view of pareto set learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.
  • Zhou et al. [2011] Aimin Zhou, Bo-Yang Qu, Hui Li, Shi-Zheng Zhao, Ponnuthurai Nagaratnam Suganthan, and Qingfu Zhang. Multiobjective evolutionary algorithms: A survey of the state of the art. Swarm and evolutionary computation, 1(1):32–49, 2011.
  • Zhou et al. [2023] Zhanhui Zhou, Jie Liu, Chao Yang, **g Shao, Yu Liu, Xiangyu Yue, Wanli Ouyang, and Yu Qiao. Beyond one-preference-for-all: Multi-objective direct preference optimization. arXiv preprint arXiv:2310.03708, 2023.
  • Zhu et al. [2023] Yiheng Zhu, Jialu Wu, Chaowen Hu, Jiahuan Yan, Chang-Yu Hsieh, Tingjun Hou, and Jian Wu. Sample-efficient multi-objective molecular optimization with gflownets. arXiv preprint arXiv:2302.04040, 2023.
\doparttoc\faketableofcontents

Supplementary Material

\parttoc

Appendix A Preliminary Theoretical Results

In this section, we prove the validity of combining reward models of all preference dimensions through linear scalarization in the RLHF optimization procedure, even though each reward model solved by the Bradley-Terry (BT) model [10] is not uniquely determined. This is formalized in the following lemma.

Lemma A.1 (Extension of Lemma 2 in [43] for multiple reward models).

Let ri(x,y)subscript𝑟𝑖𝑥𝑦r_{i}(x,y)italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) and ri(x,y)superscriptsubscript𝑟𝑖𝑥𝑦r_{i}^{\prime}(x,y)italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y ) be equivalent reward models for the i𝑖iitalic_i-th preference dimension, where ri(x,y)=ri(x,y)+ϕi(x)superscriptsubscript𝑟𝑖𝑥𝑦subscript𝑟𝑖𝑥𝑦subscriptitalic-ϕ𝑖𝑥r_{i}^{\prime}(x,y)=r_{i}(x,y)+\phi_{i}(x)italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y ) = italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) + italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ). The linear combinations r(x,y)=i=1mλiri(x,y)𝑟𝑥𝑦superscriptsubscript𝑖1𝑚subscript𝜆𝑖subscript𝑟𝑖𝑥𝑦r(x,y)=\sum_{i=1}^{m}\lambda_{i}r_{i}(x,y)italic_r ( italic_x , italic_y ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) and r(x,y)=i=1mλiri(x,y)+i=1mλiϕi(x)superscript𝑟𝑥𝑦superscriptsubscript𝑖1𝑚subscript𝜆𝑖subscript𝑟𝑖𝑥𝑦superscriptsubscript𝑖1𝑚subscript𝜆𝑖subscriptitalic-ϕ𝑖𝑥r^{\prime}(x,y)=\sum_{i=1}^{m}\lambda_{i}r_{i}(x,y)+\sum_{i=1}^{m}\lambda_{i}% \phi_{i}(x)italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) induce the same optimal policy in the constrained RL problem maxπJRLHF(π)=𝔼x𝒟[𝔼yπ(|x)[r(x,y)]β𝔻KL[π(|x)||πref(|x)]]\max_{\pi}J_{\text{RLHF}}(\pi)=\mathbb{E}_{x\sim\mathcal{D}}\left[\mathbb{E}_{% y\sim\pi(\cdot|x)}\left[r(x,y)\right]-\beta\mathbb{D}_{\text{KL}}\left[\pi(% \cdot|x)||\pi_{\text{ref}}(\cdot|x)\right]\right]roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT RLHF end_POSTSUBSCRIPT ( italic_π ) = blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π ( ⋅ | italic_x ) end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] - italic_β blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT [ italic_π ( ⋅ | italic_x ) | | italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( ⋅ | italic_x ) ] ], where β𝛽\betaitalic_β is a positive punishment factor of the KL constraint.

Remark A.2.

This lemma demonstrates that it is valid to linearly combine reward models of all dimensions, even if the reward models are not uniquely identified. It is used in analyzing the limitations of single-objective alignment and it validates the LS aggregation employed with Panacea.

Below, we provide a concise proof of Lemma A.1.

Proof.

According to the constrained RL literatures [42, 8], the policy for the reward function r(x,y)superscript𝑟𝑥𝑦r^{\prime}(x,y)italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y ) in a Kullback-Leibler (KL) constrained reinforcement learning (RL) problem can be formulated as follows:

πr(y|x)=πref(y|x)exp(1βr(x,y))yπref(y|x)exp(1βr(x,y)).subscript𝜋superscript𝑟conditional𝑦𝑥subscript𝜋refconditional𝑦𝑥1𝛽superscript𝑟𝑥𝑦subscript𝑦subscript𝜋refconditional𝑦𝑥1𝛽superscript𝑟𝑥𝑦\pi_{r^{\prime}}(y|x)=\frac{\pi_{\text{ref}}(y|x)\exp\left(\frac{1}{\beta}r^{% \prime}(x,y)\right)}{\sum_{y}\pi_{\text{ref}}(y|x)\exp\left(\frac{1}{\beta}r^{% \prime}(x,y)\right)}.italic_π start_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y | italic_x ) = divide start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y ) ) end_ARG .

Expanding the term in r(x,y)superscript𝑟𝑥𝑦r^{\prime}(x,y)italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y ), we obtain:

πr(y|x)=πref(y|x)exp(1β(i=1mλiri(x,y)+i=1mλiϕi(x)ϕ(x)))yπref(y|x)exp(1β(i=1mλiri(x,y)+i=1mλiϕi(x)ϕ(x))).subscript𝜋superscript𝑟conditional𝑦𝑥subscript𝜋refconditional𝑦𝑥1𝛽superscriptsubscript𝑖1𝑚subscript𝜆𝑖subscript𝑟𝑖𝑥𝑦subscriptsuperscriptsubscript𝑖1𝑚subscript𝜆𝑖subscriptitalic-ϕ𝑖𝑥superscriptitalic-ϕ𝑥subscript𝑦subscript𝜋refconditional𝑦𝑥1𝛽superscriptsubscript𝑖1𝑚subscript𝜆𝑖subscript𝑟𝑖𝑥𝑦subscriptsuperscriptsubscript𝑖1𝑚subscript𝜆𝑖subscriptitalic-ϕ𝑖𝑥superscriptitalic-ϕ𝑥\pi_{r^{\prime}}(y|x)=\frac{\pi_{\text{ref}}(y|x)\exp\left(\frac{1}{\beta}% \left(\sum_{i=1}^{m}\lambda_{i}r_{i}(x,y)+\underbrace{\sum_{i=1}^{m}\lambda_{i% }\phi_{i}(x)}_{\phi^{\prime}(x)}\right)\right)}{\sum_{y}\pi_{\text{ref}}(y|x)% \exp\left(\frac{1}{\beta}\left(\sum_{i=1}^{m}\lambda_{i}r_{i}(x,y)+\underbrace% {\sum_{i=1}^{m}\lambda_{i}\phi_{i}(x)}_{\phi^{\prime}(x)}\right)\right)}.italic_π start_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y | italic_x ) = divide start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) + under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) end_ARG start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) + under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) end_ARG start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) end_POSTSUBSCRIPT ) ) end_ARG .

Upon simplifying by canceling out the common term exp(ϕ(x))superscriptitalic-ϕ𝑥\exp(\phi^{\prime}(x))roman_exp ( italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ), we get:

πr(y|x)=πref(y|x)exp(1βr(x,y))exp(1β(ϕ(x)))yπref(y|x)exp(1βr(x,y))exp(1β(ϕ(x)))=πr(y|x),subscript𝜋superscript𝑟conditional𝑦𝑥subscript𝜋refconditional𝑦𝑥1𝛽𝑟𝑥𝑦cancel1𝛽superscriptitalic-ϕ𝑥subscript𝑦subscript𝜋refconditional𝑦𝑥1𝛽𝑟𝑥𝑦cancel1𝛽superscriptitalic-ϕ𝑥subscript𝜋𝑟conditional𝑦𝑥\pi_{r^{\prime}}(y|x)=\frac{\pi_{\text{ref}}(y|x)\exp\left(\frac{1}{\beta}r(x,% y)\right)\cancel{\exp\left(\frac{1}{\beta}(\phi^{\prime}(x))\right)}}{\sum_{y}% \pi_{\text{ref}}(y|x)\exp\left(\frac{1}{\beta}r(x,y)\right)\cancel{\exp\left(% \frac{1}{\beta}(\phi^{\prime}(x))\right)}}=\pi_{r}(y|x),italic_π start_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y | italic_x ) = divide start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) cancel roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ( italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) cancel roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ( italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ) ) end_ARG = italic_π start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_y | italic_x ) ,

which completes the proof.

Appendix B The Limitation of Single-Objective Alignment

In the following content, we provide a theoretical analysis that the model trained by the single-objective alignment paradigm could actually misalign with every labeler. We conduct analysis on RLHF, the most common approach. We make the following assumptions:

Assumption B.1.

Human preference can be modeled by the Bradley-Terry model [10].

Assumption B.2.

Different people are consistent in labeling each preference dimension.

These two assumptions imply that people possess the same reward model ri(x,y)subscript𝑟𝑖𝑥𝑦r_{i}(x,y)italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) for each preference dimension i𝑖iitalic_i.

Assumption B.3.

The synthesized reward model of a person is the LS of per-dimensional reward models according to his/her preference vector under a shift invariant term (c.f [43][Lemma1]). That is,

r(x,y)=i=1mλiri(x,y)+ϕ(x).𝑟𝑥𝑦superscriptsubscript𝑖1𝑚subscript𝜆𝑖subscript𝑟𝑖𝑥𝑦italic-ϕ𝑥r(x,y)=\sum_{i=1}^{m}\lambda_{i}r_{i}(x,y)+\phi(x).italic_r ( italic_x , italic_y ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) + italic_ϕ ( italic_x ) . (9)

Now we prove the main theoretical result.

Theorem B.4.

Consider the case where there are n𝑛nitalic_n labelers in total. Each labeler hhitalic_h labels a portion phsuperscript𝑝p^{h}italic_p start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT of the entire dataset, where ph[0,1],h=1nph=1formulae-sequencesuperscript𝑝01superscriptsubscript1𝑛superscript𝑝1p^{h}\in[0,1],\sum_{h=1}^{n}p^{h}=1italic_p start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ∈ [ 0 , 1 ] , ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = 1. The preference vector of labeler hhitalic_h is 𝛌h=(λ1h,λ2h,,λmh)superscript𝛌subscriptsuperscript𝜆1subscriptsuperscript𝜆2subscriptsuperscript𝜆𝑚{\bm{\lambda}}^{h}=(\lambda^{h}_{1},\lambda^{h}_{2},\ldots,\lambda^{h}_{m})bold_italic_λ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = ( italic_λ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ). The labelers have different preference vectors, i.e. j,h{1,,n},𝛌j𝛌hformulae-sequence𝑗1𝑛superscript𝛌𝑗superscript𝛌\exists\ j,h\in\{1,\ldots,n\},{\bm{\lambda}}^{j}\neq{\bm{\lambda}}^{h}∃ italic_j , italic_h ∈ { 1 , … , italic_n } , bold_italic_λ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ≠ bold_italic_λ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT. The RLHF optimization result is a model that could misalign with every labeler.

Proof.

The reward model rhsuperscript𝑟r^{h}italic_r start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT of labeler hhitalic_h is rh(x,y)=i=1mλihri(x,y)+ϕh(x)superscript𝑟𝑥𝑦superscriptsubscript𝑖1𝑚subscriptsuperscript𝜆𝑖subscript𝑟𝑖𝑥𝑦superscriptitalic-ϕ𝑥r^{h}(x,y)=\sum_{i=1}^{m}\lambda^{h}_{i}r_{i}(x,y)+\phi^{h}(x)italic_r start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x , italic_y ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) + italic_ϕ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ). Jh(θ)superscript𝐽𝜃J^{h}(\theta)italic_J start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_θ ) denotes the optimization objective corresponding to the reward model of labeler hhitalic_h. The joint optimization objective is

maxθh=1nphJh(πθ)subscript𝜃superscriptsubscript1𝑛superscript𝑝superscript𝐽subscript𝜋𝜃\displaystyle\max_{\theta}\ \sum_{h=1}^{n}p^{h}J^{h}(\pi_{\theta})roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT )
(Substituting the oracle reward function.) (10)
=\displaystyle== maxθh=1nph(𝔼x𝒟[𝔼yπθ(|x)[rh(x,y)]β𝔻KL[πθ(|x)||πref(|x)]])\displaystyle\max_{\theta}\sum_{h=1}^{n}p^{h}\left(\mathbb{E}_{x\sim\mathcal{D% }}\left[\mathbb{E}_{y\sim\pi_{\theta}(\cdot|x)}\left[r^{h}(x,y)\right]-\beta% \mathbb{D}_{\text{KL}}\left[\pi_{\theta}(\cdot|x)||\pi_{\text{ref}}(\cdot|x)% \right]\right]\right)roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) end_POSTSUBSCRIPT [ italic_r start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x , italic_y ) ] - italic_β blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) | | italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( ⋅ | italic_x ) ] ] )
(Rearrange reward terms.) (11)
=\displaystyle== maxθ𝔼x𝒟[𝔼yπθ(|x)[h=1nphrh(x,y)]β𝔻KL[πθ(|x)||πref(|x)]]\displaystyle\max_{\theta}\mathbb{E}_{x\sim\mathcal{D}}\left[\mathbb{E}_{y\sim% \pi_{\theta}(\cdot|x)}\left[\sum_{h=1}^{n}p^{h}r^{h}(x,y)\right]-\beta\mathbb{% D}_{\text{KL}}\left[\pi_{\theta}(\cdot|x)||\pi_{\text{ref}}(\cdot|x)\right]\right]roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x , italic_y ) ] - italic_β blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) | | italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( ⋅ | italic_x ) ] ]
=\displaystyle== maxθ𝔼x𝒟[𝔼yπθ(|x)[h=1nph(i=1mλihri(x,y)+ϕh(x))]β𝔻KL[πθ(|x)||πref(|x)]]\displaystyle\max_{\theta}\mathbb{E}_{x\sim\mathcal{D}}\left[\mathbb{E}_{y\sim% \pi_{\theta}(\cdot|x)}\left[\sum_{h=1}^{n}p^{h}\left(\sum_{i=1}^{m}\lambda^{h}% _{i}r_{i}(x,y)+\phi^{h}(x)\right)\right]-\beta\mathbb{D}_{\text{KL}}\left[\pi_% {\theta}(\cdot|x)||\pi_{\text{ref}}(\cdot|x)\right]\right]roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) + italic_ϕ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x ) ) ] - italic_β blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) | | italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( ⋅ | italic_x ) ] ]
(Define φ(x):=h=1nphϕh(x)assign𝜑𝑥superscriptsubscript1𝑛superscript𝑝superscriptitalic-ϕ𝑥\varphi(x)\vcentcolon=\sum_{h=1}^{n}p^{h}\phi^{h}(x)italic_φ ( italic_x ) := ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT italic_ϕ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_x )) (12)
=\displaystyle== maxθ𝔼x𝒟[𝔼yπθ(|x)[h=1ni=1mphλihri(x,y)+φ(x)]β𝔻KL[πθ(|x)||πref(|x)]]\displaystyle\max_{\theta}\mathbb{E}_{x\sim\mathcal{D}}\left[\mathbb{E}_{y\sim% \pi_{\theta}(\cdot|x)}\left[\sum_{h=1}^{n}\sum_{i=1}^{m}p^{h}\lambda^{h}_{i}r_% {i}(x,y)+\varphi(x)\right]-\beta\mathbb{D}_{\text{KL}}\left[\pi_{\theta}(\cdot% |x)||\pi_{\text{ref}}(\cdot|x)\right]\right]roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) + italic_φ ( italic_x ) ] - italic_β blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) | | italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( ⋅ | italic_x ) ] ]
=\displaystyle== maxθ𝔼x𝒟[𝔼yπθ(|x)[i=1mh=1nphλihri(x,y)+φ(x)]β𝔻KL[πθ(|x)||πref(|x)]]\displaystyle\max_{\theta}\mathbb{E}_{x\sim\mathcal{D}}\left[\mathbb{E}_{y\sim% \pi_{\theta}(\cdot|x)}\left[\sum_{i=1}^{m}\sum_{h=1}^{n}p^{h}\lambda^{h}_{i}r_% {i}(x,y)+\varphi(x)\right]-\beta\mathbb{D}_{\text{KL}}\left[\pi_{\theta}(\cdot% |x)||\pi_{\text{ref}}(\cdot|x)\right]\right]roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) + italic_φ ( italic_x ) ] - italic_β blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) | | italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( ⋅ | italic_x ) ] ]
=\displaystyle== maxθ𝔼x𝒟[𝔼yπθ(|x)[i=1m(h=1nphλih)ri(x,y)+φ(x)]β𝔻KL[πθ(|x)||πref(|x)]]\displaystyle\max_{\theta}\mathbb{E}_{x\sim\mathcal{D}}\left[\mathbb{E}_{y\sim% \pi_{\theta}(\cdot|x)}\left[\sum_{i=1}^{m}\left(\sum_{h=1}^{n}p^{h}\lambda^{h}% _{i}\right)r_{i}(x,y)+\varphi(x)\right]-\beta\mathbb{D}_{\text{KL}}\left[\pi_{% \theta}(\cdot|x)||\pi_{\text{ref}}(\cdot|x)\right]\right]roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) + italic_φ ( italic_x ) ] - italic_β blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) | | italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( ⋅ | italic_x ) ] ]
(Define λiopt:=h=1nphλih,i=1,,mformulae-sequenceassignsuperscriptsubscript𝜆𝑖optsuperscriptsubscript1𝑛superscript𝑝subscriptsuperscript𝜆𝑖𝑖1𝑚\lambda_{i}^{\text{opt}}\vcentcolon=\sum_{h=1}^{n}p^{h}\lambda^{h}_{i},i=1,% \ldots,mitalic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT opt end_POSTSUPERSCRIPT := ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , … , italic_m) (13)
=\displaystyle== maxθ𝔼x𝒟[𝔼yπθ(|x)[i=1mλioptri(x,y)+φ(x)]β𝔻KL[πθ(|x)||πref(|x)]]\displaystyle\max_{\theta}\mathbb{E}_{x\sim\mathcal{D}}\left[\mathbb{E}_{y\sim% \pi_{\theta}(\cdot|x)}\left[\sum_{i=1}^{m}\lambda^{\text{opt}}_{i}r_{i}(x,y)+% \varphi(x)\right]-\beta\mathbb{D}_{\text{KL}}\left[\pi_{\theta}(\cdot|x)||\pi_% {\text{ref}}(\cdot|x)\right]\right]roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT opt end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) + italic_φ ( italic_x ) ] - italic_β blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) | | italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( ⋅ | italic_x ) ] ]

Thus, we show that it actually optimizes with the preference vector 𝝀optsuperscript𝝀opt{\bm{\lambda}}^{\text{opt}}bold_italic_λ start_POSTSUPERSCRIPT opt end_POSTSUPERSCRIPT, with λiopt=h=1nphλih,i=1,,mformulae-sequencesuperscriptsubscript𝜆𝑖optsuperscriptsubscript1𝑛superscript𝑝subscriptsuperscript𝜆𝑖𝑖1𝑚\lambda_{i}^{\text{opt}}=\sum_{h=1}^{n}p^{h}\lambda^{h}_{i},i=1,\ldots,mitalic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT opt end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , … , italic_m. According to the constrained RL literatures [42, 8], the corresponding optimal policy can be expressed as:

πθ(y|x)=1Z(x)πref (y|x)exp(1βi=1mλioptri(x,y)).superscriptsubscript𝜋𝜃conditional𝑦𝑥1𝑍𝑥subscript𝜋ref conditional𝑦𝑥1𝛽superscriptsubscript𝑖1𝑚superscriptsubscript𝜆𝑖optsubscript𝑟𝑖𝑥𝑦\pi_{\theta}^{*}(y|x)=\frac{1}{Z(x)}\pi_{\text{ref }}(y|x)\exp\left(\frac{1}{% \beta}\sum_{i=1}^{m}\lambda_{i}^{\text{opt}}r_{i}(x,y)\right).italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_Z ( italic_x ) end_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT opt end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) ) . (15)

It is important to note that this optimal preference vector may not align with the individual preferences of each annotator. As a result, the trained model may not fully reflect the labeling criteria of any single annotator, potentially leading to discrepancies in the model’s predictions.

Appendix C Theoretical Support for Panacea with LS / Tche function

In the following content, we prove for Theorem 4.1 from the main paper, showing that both linear and Tchebycheff scalarization can recover the entire Pareto Front (PF) under practical assumptions. The proof has two subsections: first for the linear scalarization function in Section C.1, followed by the Tchebycheff aggregation function in Section C.2.

C.1 Proof for LS Aggregation Function

We provide a proof sketch for this part.

Step 1: Under the full categorical representation assumption, for any two policies π(a)(|x)\pi^{(a)}(\cdot|x)italic_π start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT ( ⋅ | italic_x ) and π(b)(|x)\pi^{(b)}(\cdot|x)italic_π start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT ( ⋅ | italic_x ), we can create a new policy (πsuperscript𝜋\pi^{\prime}italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT) that, with probability (w.p.) p𝑝pitalic_p (where 0p10𝑝10\leq p\leq 10 ≤ italic_p ≤ 1), takes π(a)(|x)\pi^{(a)}(\cdot|x)italic_π start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT ( ⋅ | italic_x ) and w.p. 1p1𝑝1-p1 - italic_p, takes π(b)(|x)\pi^{(b)}(\cdot|x)italic_π start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT ( ⋅ | italic_x ). This policy can also be represented by LLM.

Step 2: Using the above policy construction method, we prove that the objective spaces of DPO, RLHF, and SFT are convex.

Step 3: When the objective spaces are convex, the Pareto objectives found by LS aggregation function (Convex coverage set (CCS)) equal the entire Pareto front.

Step 4: By optimizing the Panacea objective function 𝔼𝝀Δm[g𝝀LS(θ)]subscript𝔼𝝀subscriptΔ𝑚delimited-[]subscriptsuperscript𝑔LS𝝀𝜃\mathbb{E}_{{\bm{\lambda}}\in\Delta_{m}}\left[g^{\mathrm{LS}}_{\bm{\lambda}}(% \theta)\right]blackboard_E start_POSTSUBSCRIPT bold_italic_λ ∈ roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_g start_POSTSUPERSCRIPT roman_LS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT ( italic_θ ) ], we can recover the entire Pareto front.

Then, we start our formal proof. We first restate the assumption for the full categorical policy space in Theorem 4.1.

Assumption C.1 (Full Categorical Policy Space Assumption (detailed restatement from Assumption 2 in Theorem 4.1)).

For a specific preference vector 𝝀𝝀{\bm{\lambda}}bold_italic_λ, the LLM policy space formed by all yπθ,𝝀(|x)y\sim\pi_{\theta,{\bm{\lambda}}}(\cdot|x)italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ , bold_italic_λ end_POSTSUBSCRIPT ( ⋅ | italic_x ) can represent all the categorical distribution set Π(x)Π𝑥\Pi(x)roman_Π ( italic_x ) for response y=[t1,,tN]𝑦subscript𝑡1subscript𝑡𝑁y=[t_{1},\ldots,t_{N}]italic_y = [ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ], where N𝑁Nitalic_N is the response length and tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote each token, given an input sentence x𝑥xitalic_x.

This assumption is proper because the probability of each token t1,,tNsubscript𝑡1subscript𝑡𝑁t_{1},\ldots,t_{N}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT (N𝑁Nitalic_N denotes the length of the output of y𝑦yitalic_y) can be represented by a LLM policy. Given the strong representation ability of LLMs, any probability value of token sequence t1,,tNsubscript𝑡1subscript𝑡𝑁t_{1},\ldots,t_{N}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT can be represented by their output. With this assumption, a direct corollary holds because the linear combination of categorical distributions is still a categorical distribution.

As a corollary of C.1, we have:

Corollary C.2.

For two policies π(a)(|x)\pi^{(a)}(\cdot|x)italic_π start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT ( ⋅ | italic_x ) and π(b)(|x)\pi^{(b)}(\cdot|x)italic_π start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT ( ⋅ | italic_x ), a new policy πsuperscript𝜋\pi^{\prime}italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT w.p. p𝑝pitalic_p (0p10𝑝10\leq p\leq 10 ≤ italic_p ≤ 1) follows π(a)(|x)\pi^{(a)}(\cdot|x)italic_π start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT ( ⋅ | italic_x ) and w.p. 1p1𝑝1-p1 - italic_p follows π(b)(|x)\pi^{(b)}(\cdot|x)italic_π start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT ( ⋅ | italic_x ) belongs to the categorical distribution Π(x)Π𝑥\Pi(x)roman_Π ( italic_x ).

The reason for that is such constructed policy is still a categorical distribution. For the next step, we use this corollary to prove the following lemma to show that the objective spaces 𝑱SFTsubscript𝑱SFT{\bm{J}}_{\mathrm{SFT}}bold_italic_J start_POSTSUBSCRIPT roman_SFT end_POSTSUBSCRIPT, 𝑱RLHFsubscript𝑱RLHF{\bm{J}}_{\mathrm{RLHF}}bold_italic_J start_POSTSUBSCRIPT roman_RLHF end_POSTSUBSCRIPT, and 𝑱DPOsubscript𝑱DPO{\bm{J}}_{\mathrm{DPO}}bold_italic_J start_POSTSUBSCRIPT roman_DPO end_POSTSUBSCRIPT are convex.

Lemma C.3 ( Convex space Lemma, adapted from [22](Eq. 13) ).

For any two objectives 𝐉alg(a)subscriptsuperscript𝐉𝑎alg{\bm{J}}^{(a)}_{\mathrm{alg}}bold_italic_J start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_alg end_POSTSUBSCRIPT and 𝐉alg(b)subscriptsuperscript𝐉𝑏alg{\bm{J}}^{(b)}_{\mathrm{alg}}bold_italic_J start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_alg end_POSTSUBSCRIPT, and for any 0<α<10𝛼10<\alpha<10 < italic_α < 1, there exists a policy πΠ(x)superscript𝜋Π𝑥\pi^{\prime}\in\Pi(x)italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Π ( italic_x ) such that α𝐉alg(a)+(1α)𝐉alg(b)=𝐉(π)𝛼subscriptsuperscript𝐉𝑎alg1𝛼subscriptsuperscript𝐉𝑏alg𝐉superscript𝜋\alpha{\bm{J}}^{(a)}_{\mathrm{alg}}+(1-\alpha){\bm{J}}^{(b)}_{\mathrm{alg}}={% \bm{J}}(\pi^{\prime})italic_α bold_italic_J start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_alg end_POSTSUBSCRIPT + ( 1 - italic_α ) bold_italic_J start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_alg end_POSTSUBSCRIPT = bold_italic_J ( italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), where 𝐉algsubscript𝐉alg{\bm{J}}_{\mathrm{alg}}bold_italic_J start_POSTSUBSCRIPT roman_alg end_POSTSUBSCRIPT can be 𝐉DPOsubscript𝐉DPO{\bm{J}}_{\mathrm{DPO}}bold_italic_J start_POSTSUBSCRIPT roman_DPO end_POSTSUBSCRIPT, 𝐉SFTsubscript𝐉SFT{\bm{J}}_{\mathrm{SFT}}bold_italic_J start_POSTSUBSCRIPT roman_SFT end_POSTSUBSCRIPT, or 𝐉RLHFsubscript𝐉RLHF{\bm{J}}_{\mathrm{RLHF}}bold_italic_J start_POSTSUBSCRIPT roman_RLHF end_POSTSUBSCRIPT.

This lemma mainly follows from Eq. 13 in [22]. We include their proof for our purpose for completeness. The objectives 𝑱SFTsubscript𝑱SFT{\bm{J}}_{\mathrm{SFT}}bold_italic_J start_POSTSUBSCRIPT roman_SFT end_POSTSUBSCRIPT, 𝑱RLHFsubscript𝑱RLHF{\bm{J}}_{\mathrm{RLHF}}bold_italic_J start_POSTSUBSCRIPT roman_RLHF end_POSTSUBSCRIPT, and 𝑱DPOsubscript𝑱DPO{\bm{J}}_{\mathrm{DPO}}bold_italic_J start_POSTSUBSCRIPT roman_DPO end_POSTSUBSCRIPT can all be written as 𝑱alg(π)=𝔼x,yD[𝒇(x,y,π(y|x))]subscript𝑱alg𝜋subscript𝔼𝑥𝑦𝐷delimited-[]𝒇𝑥𝑦𝜋conditional𝑦𝑥{\bm{J}}_{\mathrm{alg}}(\pi)=\mathbb{E}_{{x,y}\in D}[{\bm{f}}(x,y,\pi(y|x))]bold_italic_J start_POSTSUBSCRIPT roman_alg end_POSTSUBSCRIPT ( italic_π ) = blackboard_E start_POSTSUBSCRIPT italic_x , italic_y ∈ italic_D end_POSTSUBSCRIPT [ bold_italic_f ( italic_x , italic_y , italic_π ( italic_y | italic_x ) ) ] for some particular design of 𝒇(x,y,π(y|x))𝒇𝑥𝑦𝜋conditional𝑦𝑥{\bm{f}}(x,y,\pi(y|x))bold_italic_f ( italic_x , italic_y , italic_π ( italic_y | italic_x ) ). For any 0α10𝛼10\leq\alpha\leq 10 ≤ italic_α ≤ 1, by Corollary C.2, we can construct a new policy πsuperscript𝜋\pi^{\prime}italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and a uniform random variable SU(0,1)similar-to𝑆𝑈01S\sim U(0,1)italic_S ∼ italic_U ( 0 , 1 ) such that:

π(y|x)={πa(y|x)if S<απb(y|x)if Sαsuperscript𝜋conditional𝑦𝑥casessuperscript𝜋𝑎conditional𝑦𝑥if 𝑆𝛼superscript𝜋𝑏conditional𝑦𝑥if 𝑆𝛼\pi^{\prime}(y|x)=\begin{cases}\pi^{a}(y|x)&\text{if }S<\alpha\\ \pi^{b}(y|x)&\text{if }S\geq\alpha\end{cases}italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y | italic_x ) = { start_ROW start_CELL italic_π start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_y | italic_x ) end_CELL start_CELL if italic_S < italic_α end_CELL end_ROW start_ROW start_CELL italic_π start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( italic_y | italic_x ) end_CELL start_CELL if italic_S ≥ italic_α end_CELL end_ROW

Then,

𝑱(π)𝑱superscript𝜋\displaystyle{\bm{J}}(\pi^{\prime})bold_italic_J ( italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =𝔼(x,y)𝒟[𝒇(x,y,π(y|x))]absentsubscript𝔼similar-to𝑥𝑦𝒟delimited-[]𝒇𝑥𝑦superscript𝜋conditional𝑦𝑥\displaystyle=\mathbb{E}_{(x,y)\sim\mathcal{D}}[{\bm{f}}(x,y,\pi^{\prime}(y|x))]= blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D end_POSTSUBSCRIPT [ bold_italic_f ( italic_x , italic_y , italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y | italic_x ) ) ]
=𝔼SU(0,1)𝔼(x,y)𝒟[𝒇(x,y,π(y|x))|S]absentsubscript𝔼similar-to𝑆𝑈01subscript𝔼similar-to𝑥𝑦𝒟delimited-[]conditional𝒇𝑥𝑦superscript𝜋conditional𝑦𝑥𝑆\displaystyle=\mathbb{E}_{S\sim U(0,1)}\mathbb{E}_{(x,y)\sim\mathcal{D}}[{\bm{% f}}(x,y,\pi^{\prime}(y|x))|S]= blackboard_E start_POSTSUBSCRIPT italic_S ∼ italic_U ( 0 , 1 ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D end_POSTSUBSCRIPT [ bold_italic_f ( italic_x , italic_y , italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y | italic_x ) ) | italic_S ]
=α𝔼(x,y)𝒟[𝒇(x,y,π(y|x))|S<α]+(1α)𝔼(x,y)𝒟[𝒇(x,y,π(y|x))|Sα]absent𝛼subscript𝔼similar-to𝑥𝑦𝒟delimited-[]conditional𝒇𝑥𝑦superscript𝜋conditional𝑦𝑥𝑆𝛼1𝛼subscript𝔼similar-to𝑥𝑦𝒟delimited-[]conditional𝒇𝑥𝑦superscript𝜋conditional𝑦𝑥𝑆𝛼\displaystyle=\alpha\mathbb{E}_{(x,y)\sim\mathcal{D}}[{\bm{f}}(x,y,\pi^{\prime% }(y|x))|S<\alpha]+(1-\alpha)\mathbb{E}_{(x,y)\sim\mathcal{D}}[{\bm{f}}(x,y,\pi% ^{\prime}(y|x))|S\geq\alpha]= italic_α blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D end_POSTSUBSCRIPT [ bold_italic_f ( italic_x , italic_y , italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y | italic_x ) ) | italic_S < italic_α ] + ( 1 - italic_α ) blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D end_POSTSUBSCRIPT [ bold_italic_f ( italic_x , italic_y , italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y | italic_x ) ) | italic_S ≥ italic_α ]
=α𝔼(x,y)𝒟[𝒇(x,y,π(a)(y|x))]+(1α)𝔼(x,y)𝒟[𝒇(x,y,π(b)(y|x))]absent𝛼subscript𝔼similar-to𝑥𝑦𝒟delimited-[]𝒇𝑥𝑦superscript𝜋𝑎conditional𝑦𝑥1𝛼subscript𝔼similar-to𝑥𝑦𝒟delimited-[]𝒇𝑥𝑦superscript𝜋𝑏conditional𝑦𝑥\displaystyle=\alpha\mathbb{E}_{(x,y)\sim\mathcal{D}}[{\bm{f}}(x,y,\pi^{(a)}(y% |x))]+(1-\alpha)\mathbb{E}_{(x,y)\sim\mathcal{D}}[{\bm{f}}(x,y,\pi^{(b)}(y|x))]= italic_α blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D end_POSTSUBSCRIPT [ bold_italic_f ( italic_x , italic_y , italic_π start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT ( italic_y | italic_x ) ) ] + ( 1 - italic_α ) blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D end_POSTSUBSCRIPT [ bold_italic_f ( italic_x , italic_y , italic_π start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT ( italic_y | italic_x ) ) ]
=α𝑱(π(a))+(1α)𝑱(π(b))absent𝛼𝑱superscript𝜋𝑎1𝛼𝑱superscript𝜋𝑏\displaystyle=\alpha{\bm{J}}(\pi^{(a)})+(1-\alpha){\bm{J}}(\pi^{(b)})= italic_α bold_italic_J ( italic_π start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT ) + ( 1 - italic_α ) bold_italic_J ( italic_π start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT )

Thus, for any convex combination of 𝑱(π(a))𝑱superscript𝜋𝑎{\bm{J}}(\pi^{(a)})bold_italic_J ( italic_π start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT ) and 𝑱(π(b))𝑱superscript𝜋𝑏{\bm{J}}(\pi^{(b)})bold_italic_J ( italic_π start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT ), there exists a policy πsuperscript𝜋\pi^{\prime}italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that 𝑱(π)=α𝑱(π(a))+(1α)𝑱(π(b))𝑱superscript𝜋𝛼𝑱superscript𝜋𝑎1𝛼𝑱superscript𝜋𝑏{\bm{J}}(\pi^{\prime})=\alpha{\bm{J}}(\pi^{(a)})+(1-\alpha){\bm{J}}(\pi^{(b)})bold_italic_J ( italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_α bold_italic_J ( italic_π start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT ) + ( 1 - italic_α ) bold_italic_J ( italic_π start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT ), indicating that the space of 𝑱(π)𝑱𝜋{\bm{J}}(\pi)bold_italic_J ( italic_π ) is convex. We denote the full space of 𝑱(π)𝑱𝜋{\bm{J}}(\pi)bold_italic_J ( italic_π ) for all policies as 𝕁𝕁{\mathbb{J}}blackboard_J.

For the third step, we use Lemma C.3 to establish that linear scalarization functions have the capability to discover the complete PF by traversing the entire preference simplex ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (i.e., the approach employed in Panacea). To prove for that, we introduce the concept of the convex coverage set, which is the objective set that can be found by optimizing the linear scalization function with all preference vector 𝝀Δm𝝀subscriptΔ𝑚\bm{\lambda}\in\Delta_{m}bold_italic_λ ∈ roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. We now define CCS, which is the set of solutions can be found LS.

Definition C.4 (Convex Coverage Set (CCS), adapted from [45](Def. 9)).

The CCS contains the objective such that there exists a preference vector 𝝀𝝀{\bm{\lambda}}bold_italic_λ where the inner product of 𝝀𝝀{\bm{\lambda}}bold_italic_λ and this objective is greater than that of 𝝀𝝀{\bm{\lambda}}bold_italic_λ with any other objective vectors in the objective space. CCS :={𝑱𝕁|𝝀Δm:=\left\{{\bm{J}}\in{\mathbb{J}}|\exists{\bm{\lambda}}\in\Delta_{m}\right.:= { bold_italic_J ∈ blackboard_J | ∃ bold_italic_λ ∈ roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT s.t. 𝝀𝑱𝝀𝑱,𝑱𝕁}{\bm{\lambda}}^{\top}{\bm{J}}\geq{\bm{\lambda}}^{\top}{\bm{J}}^{\prime},% \forall{\bm{J}}^{\prime}\in{\mathbb{J}}\}bold_italic_λ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_J ≥ bold_italic_λ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_J start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , ∀ bold_italic_J start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_J }.

Finally, we prove for that that when the objective space is convex, the linear scalarization can recover the whole Pareto objective set, i.e., 𝒯=CCS𝒯CCS{\mathcal{T}}=\mathrm{CCS}caligraphic_T = roman_CCS, where 𝒯𝒯{\mathcal{T}}caligraphic_T denote the objective vectors forming the Pareto front.

Proof.

The PF 𝒯𝒯{\mathcal{T}}caligraphic_T is a subset of the boundary of the objective space, denoted as (𝑱(Π))𝑱Π\partial({\bm{J}}(\Pi))∂ ( bold_italic_J ( roman_Π ) ). By proving that 𝑱(Π)𝑱Π{\bm{J}}(\Pi)bold_italic_J ( roman_Π ) is a convex set, we can apply the supporting hyperplane theorem [9] (Sec. 2.5.2). According to this theorem, for every element 𝒓𝒓{\bm{r}}bold_italic_r in (𝑱(Π))𝑱Π\partial({\bm{J}}(\Pi))∂ ( bold_italic_J ( roman_Π ) ), there exists 𝝀𝝀{\bm{\lambda}}\in\mathbb{R}bold_italic_λ ∈ blackboard_R such that 𝝀T(𝒓𝒓)0superscript𝝀𝑇𝒓superscript𝒓0{\bm{\lambda}}^{T}({\bm{r}}-{\bm{r}}^{\prime})\geq 0bold_italic_λ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_italic_r - bold_italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≥ 0 for all 𝒓𝑱(Π)superscript𝒓𝑱Π{\bm{r}}^{\prime}\in{\bm{J}}(\Pi)bold_italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ bold_italic_J ( roman_Π ). Moreover, when 𝒓𝒓{\bm{r}}bold_italic_r is Pareto optimal, such 𝝀0succeeds-or-equals𝝀0{\bm{\lambda}}\succeq 0bold_italic_λ ⪰ 0. Hence, we have 𝝀T(𝒓𝒓)0superscript𝝀𝑇𝒓superscript𝒓0{\bm{\lambda}}^{T}({\bm{r}}-{\bm{r}}^{\prime})\geq 0bold_italic_λ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_italic_r - bold_italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≥ 0 for all 𝒓𝑱(Π)superscript𝒓𝑱Π{\bm{r}}^{\prime}\in{\bm{J}}(\Pi)bold_italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ bold_italic_J ( roman_Π ) and 𝝀Δm𝝀subscriptΔ𝑚{\bm{\lambda}}\in\Delta_{m}bold_italic_λ ∈ roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. This condition implies that 𝒯CCS𝒯CCS{\mathcal{T}}\subset\mathrm{CCS}caligraphic_T ⊂ roman_CCS. Since it has been established that CCS𝒯CCS𝒯\mathrm{CCS}\subset{\mathcal{T}}roman_CCS ⊂ caligraphic_T, we can conclude that CCS=𝒯CCS𝒯\mathrm{CCS}={\mathcal{T}}roman_CCS = caligraphic_T. ∎

For the last step, we demonstrate that by optimizing 𝔼𝝀Δm[g𝝀LS(θ)]subscript𝔼𝝀subscriptΔ𝑚delimited-[]subscriptsuperscript𝑔LS𝝀𝜃\mathbb{E}_{{\bm{\lambda}}\in\Delta_{m}}\left[g^{\mathrm{LS}}_{\bm{\lambda}}(% \theta)\right]blackboard_E start_POSTSUBSCRIPT bold_italic_λ ∈ roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_g start_POSTSUPERSCRIPT roman_LS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT ( italic_θ ) ] using the LS aggregation function, we can recover almost the entire Pareto front. This is because, if a larger non-zero measure Pareto front could not be found, it implies that there exist non-zero measure preference vectors that would make the expectation function value 𝔼𝝀Δm[g𝝀LS(θ)]subscript𝔼𝝀subscriptΔ𝑚delimited-[]subscriptsuperscript𝑔LS𝝀𝜃\mathbb{E}_{{\bm{\lambda}}\in\Delta_{m}}\left[g^{\mathrm{LS}}_{\bm{\lambda}}(% \theta)\right]blackboard_E start_POSTSUBSCRIPT bold_italic_λ ∈ roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_g start_POSTSUPERSCRIPT roman_LS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT ( italic_θ ) ] exceed its optimal value, which is contradictory of our assumption.

C.2 Proof for Tchebycheff Aggregation Function

To prove that using the Tchebycheff aggregation function allows Panacea to recover the full Pareto front, we introduce the following lemma:

Lemma C.5 (Adapted from [13], Theorem 3.1).

A feasible solution θ𝜃\thetaitalic_θ is Pareto optimal if and only if there exists a weight vector λ𝜆\lambdaitalic_λ such that θ𝜃\thetaitalic_θ is an optimal solution to the aggregation function (Equation 7) defined in the main paper.

Using this lemma and assuming Panacea can represent the Pareto policy under all preferences (Assumption 1 in Theorem 4.1), optimizing the expectation loss

𝔼𝝀Δmg𝝀Tche(θ)subscript𝔼𝝀subscriptΔ𝑚subscriptsuperscript𝑔Tche𝝀𝜃-\mathbb{E}_{{\bm{\lambda}}\in\Delta_{m}}g^{\mathrm{Tche}}_{\bm{\lambda}}(\theta)- blackboard_E start_POSTSUBSCRIPT bold_italic_λ ∈ roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT roman_Tche end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT ( italic_θ )

allows Panacea to recover almost every policy.

Proof.

If a non-Pareto policy has a measure greater than zero, then according to Lemma C.5, there exists a preference set of greater than zero measure where the non-Pareto policy has a higher value compared to the optimal value of the Tchebycheff function under the corresponding preferences. This implies that 𝔼𝝀Δmg𝝀Tche(θ)subscript𝔼𝝀subscriptΔ𝑚subscriptsuperscript𝑔Tche𝝀𝜃\mathbb{E}_{{\bm{\lambda}}\in\Delta_{m}}g^{\mathrm{Tche}}_{\bm{\lambda}}(\theta)blackboard_E start_POSTSUBSCRIPT bold_italic_λ ∈ roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT roman_Tche end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT ( italic_θ ) has not been optimized to its optimal value, contradicting Assumption 1 in Theorem 4.1. ∎

Appendix D Aggregated Training Objectives for Panacea

In this section, we present the LS / Tche aggregated training objectives for Panacea with RLHF / DPO / SFT. In RLHF, reward models ri,i=1,,mformulae-sequencesubscript𝑟𝑖𝑖1𝑚r_{i},i=1,\ldots,mitalic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , … , italic_m are learned for each preference dimension. For a specific preference vector, the LS aggregated objective function is

maxθg𝝀LS(θ)=maxθ𝔼x𝒟[𝔼yπθ,𝝀(|x)[i=1mλiri(x,y)]β𝔻KL[πθ,𝝀(|x)||πref(|x)]].\displaystyle\max_{\theta}g^{\mathrm{LS}}_{\bm{\lambda}}(\theta)=\max_{\theta}% \ \mathbb{E}_{x\sim\mathcal{D}}\left[\mathbb{E}_{y\sim\pi_{\theta,{\bm{\lambda% }}}(\cdot|x)}\left[\sum_{i=1}^{m}\lambda_{i}r_{i}(x,y)\right]-\beta\mathbb{D}_% {\text{KL}}\left[\pi_{\theta,{\bm{\lambda}}}(\cdot|x)||\pi_{\text{ref}}(\cdot|% x)\right]\right].roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT roman_LS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT ( italic_θ ) = roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ , bold_italic_λ end_POSTSUBSCRIPT ( ⋅ | italic_x ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) ] - italic_β blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT italic_θ , bold_italic_λ end_POSTSUBSCRIPT ( ⋅ | italic_x ) | | italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( ⋅ | italic_x ) ] ] . (16)

The Tche aggregated objective is

maxθg𝝀Tche(θ)=maxθ𝔼x𝒟[𝔼yπθ,𝝀(|x)[max1imλi(ziri(x,y))]β𝔻KL[πθ,𝝀(|x)||πref(|x)]],\displaystyle\max_{\theta}g^{\mathrm{Tche}}_{\bm{\lambda}}(\theta)=\max_{% \theta}\ \mathbb{E}_{x\sim\mathcal{D}}\left[\mathbb{E}_{y\sim\pi_{\theta,{\bm{% \lambda}}}(\cdot|x)}\left[-\max_{1\leq i\leq m}\lambda_{i}(z_{i}-r_{i}(x,y))% \right]-\beta\mathbb{D}_{\text{KL}}\left[\pi_{\theta,{\bm{\lambda}}}(\cdot|x)|% |\pi_{\text{ref}}(\cdot|x)\right]\right],roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT roman_Tche end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT ( italic_θ ) = roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ , bold_italic_λ end_POSTSUBSCRIPT ( ⋅ | italic_x ) end_POSTSUBSCRIPT [ - roman_max start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_m end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) ) ] - italic_β blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT italic_θ , bold_italic_λ end_POSTSUBSCRIPT ( ⋅ | italic_x ) | | italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( ⋅ | italic_x ) ] ] , (17)

where zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the maximum reward for preference dimension i𝑖iitalic_i. Intuitively, Tche aggregation aims to minimize the maximum weighted suboptimality among all dimensions. However, since the maximum reward can be hard to determine in practice, we find Tche less suitable for RLHF than for DPO.

DPO transforms the reinforcement learning objective into a supervised objective, whose LS aggregated objective is

maxθg𝝀LS(θ)=subscript𝜃subscriptsuperscript𝑔LS𝝀𝜃absent\displaystyle\max_{\theta}g^{\mathrm{LS}}_{\bm{\lambda}}(\theta)=roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT roman_LS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT ( italic_θ ) = maxθi=1mλiJDPO,i(πθ,𝝀)subscript𝜃superscriptsubscript𝑖1𝑚subscript𝜆𝑖subscript𝐽DPO𝑖subscript𝜋𝜃𝝀\displaystyle\max_{\theta}\sum_{i=1}^{m}\lambda_{i}J_{\text{DPO},i}(\pi_{% \theta,{\bm{\lambda}}})roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT DPO , italic_i end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ , bold_italic_λ end_POSTSUBSCRIPT )
=\displaystyle== maxθi=1mλi𝔼(x,yw,yl)𝒟i[logσ(βlogπθ,𝝀(yw|x)πref(yw|x)βlogπθ,𝝀(yl|x)πref(yl|x))].subscript𝜃superscriptsubscript𝑖1𝑚subscript𝜆𝑖subscript𝔼similar-to𝑥subscript𝑦𝑤subscript𝑦𝑙subscript𝒟𝑖delimited-[]𝜎𝛽subscript𝜋𝜃𝝀conditionalsubscript𝑦𝑤𝑥subscript𝜋refconditionalsubscript𝑦𝑤𝑥𝛽subscript𝜋𝜃𝝀conditionalsubscript𝑦𝑙𝑥subscript𝜋refconditionalsubscript𝑦𝑙𝑥\displaystyle\max_{\theta}\sum_{i=1}^{m}\lambda_{i}\mathbb{E}_{(x,y_{w},y_{l})% \sim\mathcal{D}_{i}}\left[\log\sigma\left(\beta\log\frac{\pi_{\theta,{\bm{% \lambda}}}\left(y_{w}|x\right)}{\pi_{\mathrm{ref}}\left(y_{w}|x\right)}-\beta% \log\frac{\pi_{\theta,{\bm{\lambda}}}\left(y_{l}|x\right)}{\pi_{\mathrm{ref}}% \left(y_{l}|x\right)}\right)\right].roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ , bold_italic_λ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG - italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ , bold_italic_λ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG ) ] . (18)

To derive the Tche aggregated objective, we have

maxθg𝝀Tche(θ)=subscript𝜃subscriptsuperscript𝑔Tche𝝀𝜃absent\displaystyle\max_{\theta}g^{\mathrm{Tche}}_{\bm{\lambda}}(\theta)=roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT roman_Tche end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT ( italic_θ ) = maxθmin1imλi(JDPO,i(πθ,𝝀)zi)subscript𝜃subscript1𝑖𝑚subscript𝜆𝑖subscript𝐽DPO𝑖subscript𝜋𝜃𝝀subscript𝑧𝑖\displaystyle\max_{\theta}\min_{1\leq i\leq m}\lambda_{i}(J_{\text{DPO},i}(\pi% _{\theta,{\bm{\lambda}}})-z_{i})roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_m end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_J start_POSTSUBSCRIPT DPO , italic_i end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ , bold_italic_λ end_POSTSUBSCRIPT ) - italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
=\displaystyle== maxθmin1imλiJDPO,i(πθ,𝝀)subscript𝜃subscript1𝑖𝑚subscript𝜆𝑖subscript𝐽DPO𝑖subscript𝜋𝜃𝝀\displaystyle\max_{\theta}\min_{1\leq i\leq m}\lambda_{i}J_{\text{DPO},i}(\pi_% {\theta,{\bm{\lambda}}})roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_m end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT DPO , italic_i end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ , bold_italic_λ end_POSTSUBSCRIPT )
=\displaystyle== maxθmin1imλi𝔼(x,yw,yl)𝒟i[logσ(βlogπθ,𝝀(yw|x)πref(yw|x)βlogπθ,𝝀(yl|x)πref(yl|x))]subscript𝜃subscript1𝑖𝑚subscript𝜆𝑖subscript𝔼similar-to𝑥subscript𝑦𝑤subscript𝑦𝑙subscript𝒟𝑖delimited-[]𝜎𝛽subscript𝜋𝜃𝝀conditionalsubscript𝑦𝑤𝑥subscript𝜋refconditionalsubscript𝑦𝑤𝑥𝛽subscript𝜋𝜃𝝀conditionalsubscript𝑦𝑙𝑥subscript𝜋refconditionalsubscript𝑦𝑙𝑥\displaystyle\max_{\theta}\min_{1\leq i\leq m}\lambda_{i}\mathbb{E}_{(x,y_{w},% y_{l})\sim\mathcal{D}_{i}}\left[\log\sigma\left(\beta\log\frac{\pi_{\theta,{% \bm{\lambda}}}\left(y_{w}|x\right)}{\pi_{\mathrm{ref}}\left(y_{w}|x\right)}-% \beta\log\frac{\pi_{\theta,{\bm{\lambda}}}\left(y_{l}|x\right)}{\pi_{\mathrm{% ref}}\left(y_{l}|x\right)}\right)\right]roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_m end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ , bold_italic_λ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG - italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ , bold_italic_λ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG ) ] (19)

Since the optimal value zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for per-dimension DPO objective is 00, this is naturally compatible with Tche aggregation.

Finally, the LS aggregated SFT objective is

maxθg𝝀LS(θ)=maxθi=1mλiJSFT,i(πθ,𝝀)=maxθi=1mλi𝔼(x,y)𝒟i[logπθ,𝝀(y|x)].subscript𝜃subscriptsuperscript𝑔LS𝝀𝜃subscript𝜃superscriptsubscript𝑖1𝑚subscript𝜆𝑖subscript𝐽SFT𝑖subscript𝜋𝜃𝝀subscript𝜃superscriptsubscript𝑖1𝑚subscript𝜆𝑖subscript𝔼similar-to𝑥𝑦subscript𝒟𝑖delimited-[]subscript𝜋𝜃𝝀conditional𝑦𝑥\displaystyle\max_{\theta}g^{\mathrm{LS}}_{\bm{\lambda}}(\theta)=\max_{\theta}% \sum_{i=1}^{m}\lambda_{i}J_{\text{SFT},i}(\pi_{\theta,{\bm{\lambda}}})=\max_{% \theta}\sum_{i=1}^{m}\lambda_{i}\mathbb{E}_{(x,y)\sim\mathcal{D}_{i}}\left[% \log\pi_{\theta,{\bm{\lambda}}}(y|x)\right].roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT roman_LS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT ( italic_θ ) = roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT SFT , italic_i end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ , bold_italic_λ end_POSTSUBSCRIPT ) = roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_π start_POSTSUBSCRIPT italic_θ , bold_italic_λ end_POSTSUBSCRIPT ( italic_y | italic_x ) ] . (20)

Similar to DPO, since the optimal value zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for per-dimension SFT objective is 0, the Tche aggregation of SFT objectives is

maxθg𝝀Tche(θ)=subscript𝜃subscriptsuperscript𝑔Tche𝝀𝜃absent\displaystyle\max_{\theta}g^{\mathrm{Tche}}_{\bm{\lambda}}(\theta)=roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT roman_Tche end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT ( italic_θ ) = maxθmin1imλi(JSFT,i(πθ,𝝀)zi)subscript𝜃subscript1𝑖𝑚subscript𝜆𝑖subscript𝐽SFT𝑖subscript𝜋𝜃𝝀subscript𝑧𝑖\displaystyle\max_{\theta}\min_{1\leq i\leq m}\lambda_{i}(J_{\text{SFT},i}(\pi% _{\theta,{\bm{\lambda}}})-z_{i})roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_m end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_J start_POSTSUBSCRIPT SFT , italic_i end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ , bold_italic_λ end_POSTSUBSCRIPT ) - italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
=\displaystyle== maxθmin1imλiJSFT,i(πθ,𝝀)subscript𝜃subscript1𝑖𝑚subscript𝜆𝑖subscript𝐽SFT𝑖subscript𝜋𝜃𝝀\displaystyle\max_{\theta}\min_{1\leq i\leq m}\lambda_{i}J_{\text{SFT},i}(\pi_% {\theta,{\bm{\lambda}}})roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_m end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT SFT , italic_i end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ , bold_italic_λ end_POSTSUBSCRIPT )
=\displaystyle== maxθmin1imλi𝔼(x,y)𝒟i[logπθ,𝝀(y|x)].subscript𝜃subscript1𝑖𝑚subscript𝜆𝑖subscript𝔼similar-to𝑥𝑦subscript𝒟𝑖delimited-[]subscript𝜋𝜃𝝀conditional𝑦𝑥\displaystyle\max_{\theta}\min_{1\leq i\leq m}\lambda_{i}\mathbb{E}_{(x,y)\sim% \mathcal{D}_{i}}\left[\log\pi_{\theta,{\bm{\lambda}}}(y|x)\right].roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_m end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_π start_POSTSUBSCRIPT italic_θ , bold_italic_λ end_POSTSUBSCRIPT ( italic_y | italic_x ) ] . (21)

Appendix E Experiment Details and Additional Results

In this section, we present experimental details including computational resources, algorithm implementation, data curation, experiment setup, and evaluation details, and analyze additional results. All our experiments are conducted on an 8×\times×A800-80GB GPU server. Other details are elaborated below.

E.1 Core Implementation of Panacea

Refer to caption
Figure 8: Core implementation of Panacea.

Our implementation is based on the Safe-RLHF [14] codebase. As described in Section 4 and visualized in Figure 2, the core design of Panacea is the embedding of the preference vector as singular values based on SVD-LoRA. Its core code is presented in Figure 8. In our experiments, we perform Panacea adaptation to all self-attention and MLP layers. We initialize the singular values and preference scaling to zero, so as not to impact the model behavior at the beginning of training [21, 56]. In each iteration, we sample a preference vector from the preference simplex, embed it into the model, and train the model on the aggregated objective.

E.2 Data Curation

In the helpful-harmless (HH) problem in Section 5.1, we use the BeaverTails dataset [27], which contains both helpfulness and harmlessness preference labels. In the augmented helpful-harmless-concise (HHC) problem in Section 5.2, we again use the BeaverTails dataset. For RLHF, we define the reward model as a rectified affine function,

rconcise(x,y)={rmax,lycrmax+1lyc,otherwisesubscript𝑟concise𝑥𝑦casessubscript𝑟maxsubscript𝑙𝑦𝑐otherwisesubscript𝑟max1subscript𝑙𝑦𝑐otherwiseotherwiser_{\text{concise}}(x,y)=\begin{cases}r_{\text{max}},\ l_{y}\leq c\\ r_{\text{max}}+1-\frac{l_{y}}{c},\ \text{otherwise}\\ \end{cases}italic_r start_POSTSUBSCRIPT concise end_POSTSUBSCRIPT ( italic_x , italic_y ) = { start_ROW start_CELL italic_r start_POSTSUBSCRIPT max end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ≤ italic_c end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_r start_POSTSUBSCRIPT max end_POSTSUBSCRIPT + 1 - divide start_ARG italic_l start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG start_ARG italic_c end_ARG , otherwise end_CELL start_CELL end_CELL end_ROW

where rmaxsubscript𝑟maxr_{\text{max}}italic_r start_POSTSUBSCRIPT max end_POSTSUBSCRIPT defines the maximum reward, lysubscript𝑙𝑦l_{y}italic_l start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT denotes token length of response y𝑦yitalic_y, and c𝑐citalic_c defines both the threshold for maximum reward and the slope of concise reward model. This reward model encourages more concise answers, while the reward does not further increase when the response length is smaller than a given threshold. For DPO, we label the shorter response to each prompt as preferred.

In the Chat multi-dimensional alignment problem in Section 5.3, we curate SFT data by letting Llama-3-8B-Instruct [2] generate responses for Alpaca prompts [47] in each dimension. Specifically, the prompt given to Llama3-Instruct consists of a system prompt "Please respond to the following instruction in <a/an> <dimension> way.", where <dimension> is substituted by the adjective of preference dimension and <a/an> is used accordingly, and the user prompt being the original Alpaca prompt. We employ vLLM [30] for fast model inference to accelerate data generation.

E.3 Experiment Setup

Table 2: Common hyperparams of Panacea with RLHF.
\hlineB3 Hyperparams Values Hyperparams Values
\hlineB2 max_length 512 critic_weight_decay 0.0
kl_coeff 0.02 critic_lr_scheduler_type “constant"
clip_range_ratio 0.2 critic_lr_warmup_ratio 0.03
clip_range_score 50.0 critic_gradient_checkpointing true
clip_range_value 5.0 normalize_reward false
epochs 2 seed 42
update_iters 1 fp16 false
gradient_accumulation_steps 2 bf16 true
actor_lr 0.002 tf32 true
actor_weight_decay 0.01 lora_dim 8
actor_lr_scheduler_type “cosine" lora_scaling 512
actor_lr_warmup_ratio 0.03 only_optimize_lora true
actor_gradient_checkpointing true lora_module_name “layers."
critic_lr 0.001 num_return_sequences 1
repetition_penalty 1.0 temperature 1.0
top_p 1.0
\hlineB3

In this part, we present details about the experiment setup. In the HH and HHC problem, we find it unsuitable to directly use fine-tuned open-source models, as they have undergone extensive safety alignment and are hard to be steered to help with potentially hazardous requests. Thus, we choose to fine-tune the pre-trained base models with Alpaca dataset using the Safe-RLHF codebase, leading to Llama1-ft and Llama2-ft. The reward models are trained upon these SFT models. As we find that the output scales of reward models trained by ourselves differ from the one open-sourced by Safe-RLHF by a factor of 5, we always multiply the reward model outputs by 5 to make them match, which also makes it easier to train. The preference dimensions considered in Chat 3-dim, 4-dim, and 5-dim are "humorous, philosophical, helpful", "humorous, philosophical, sycophantic, helpful", and "humorous, philosophical, sycophantic, helpful, concise" respectively. As for the rank of Panacea, we always fix k𝑘kitalic_k to 8, and m𝑚mitalic_m equals the number of preference dimensions. As the baselines learn one model for only one preference vector in one experiment, we let its rank be k+1𝑘1k+1italic_k + 1 for fair comparison. When sampling from the preference simplex, we sample the vertices, i.e. (0,1),(1,0)0110(0,1),(1,0)( 0 , 1 ) , ( 1 , 0 ), with higher probability, so as to force the singular vectors to optimize their objectives. In Table 2, Table 3, and Table 4 we provide the common hyperparameters for Panacea with RLHF, DPO, and SFT. Different hyperparameters include: in HH with RLHF and Llama1-ft, batch_size =16absent16=16= 16, ptx_coeff =16absent16=16= 16; in HH and HHC with RLHF and Llama2-ft, batch_size =8absent8=8= 8, ptx_coeff =4absent4=4= 4; in HH with DPO and Llama1-ft, learning_rate =0.0002absent0.0002=0.0002= 0.0002; in HH and HHC with DPO and Llama2-ft, learning_rate =0.001absent0.001=0.001= 0.001; in Chat 3, 4, 5-dim with SFT and Llama3-Instruct, batch_size =16absent16=16= 16; in Chat 10-dim with SFT and Llama3-Instruct, batch_size =8absent8=8= 8. We also note that in HHC with RLHF experiment, the concise reward model is defined with max_concise_reward =4absent4=4= 4 and concise_scale=50absent50=50= 50. RS is trained with the same hyperparameters.

Table 3: Common hyperparams of Panacea with DPO.
\hlineB3 Hyperparams Values Hyperparams Values Hyperparams Values
\hlineB2 max_length 512 lora_dim 8 epochs 1
scale_coeff 0.1 lora_scaling 512 seed 42
weight_decay 0.05 only_optimize_lora true fp16 false
batch_size 16 lora_module_name “layers." bf16 true
gradient_checkpointing true lr_warmup_ratio 0.03 tf32 true
gradient_steps 1 lr_scheduler_type “cosine"
\hlineB3
Table 4: Common hyperparams of Panacea with SFT.
\hlineB3 Hyperparams Values Hyperparams Values Hyperparams Values
\hlineB2 max_length 512 lora_dim 8 epochs 4
weight_decay 0.0 lora_scaling 512 seed 42
learning_rate 0.0002 only_optimize_lora true fp16 false
gradient_checkpointing true lora_module_name “layers." bf16 true
gradient_steps 2 lr_warmup_ratio 0.03 tf32 true
lr_scheduler_type “cosine"
\hlineB3
Refer to caption
Figure 9: Hypervolume illustration.

E.4 Evaluation Details

In evaluation, we evenly sample preference vectors from the preference simplex ΔmsubscriptΔ𝑚\Delta_{m}roman_Δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to comprehensively reflect the quality of the learned fronts. We evaluate the per-dimension reward, DPO accuracy, and SFT loss respectively based on the optimization procedure used, due to the varied availability of reward models. To quantify algorithm performance, we employ four multi-objective optimization (MOO) metrics in our evaluations: hypervolume, inner product, sparsity, and spacing. Let 𝝃={ξ1,ξ2,,ξm}𝝃subscript𝜉1subscript𝜉2subscript𝜉𝑚\bm{\xi}=\{\xi_{1},\xi_{2},\ldots,\xi_{m}\}bold_italic_ξ = { italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_ξ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } represents the evaluation results of the learned model with a preference vector. Let 𝚵𝚵\bm{\Xi}bold_Ξ be the set of evaluated solutions. These metrics are defined as follows.

  1. 1.

    Hypervolume (HV):

    HV=Vol({𝝃|𝝃𝚵,𝒛𝝃𝝃}).HVVolconditional-set𝝃formulae-sequencesuperscript𝝃𝚵precedes-or-equals𝒛𝝃precedes-or-equalssuperscript𝝃\mathrm{HV}=\mathrm{Vol}(\{\bm{\xi}|\exists\ \bm{\xi}^{\prime}\in\bm{\Xi},\bm{% z}\preceq\bm{\xi}\preceq\bm{\xi}^{\prime}\}).roman_HV = roman_Vol ( { bold_italic_ξ | ∃ bold_italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ bold_Ξ , bold_italic_z ⪯ bold_italic_ξ ⪯ bold_italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } ) .

    This set includes any evaluation vector that dominates a reference point 𝒛𝒛\bm{z}bold_italic_z and is dominated by at least one objective in 𝚵𝚵\bm{\Xi}bold_Ξ. 𝒛𝒛\bm{z}bold_italic_z is a fixed reference point dominated by all solutions in 𝚵𝚵\bm{\Xi}bold_Ξ. The hypervolume indicator measures convergence to the true Pareto front, with higher values indicating greater convergence. A visual illustration is provided in Figure 9.

  2. 2.

    Inner Product:

    InnerProduct=𝝀,𝝃.InnerProduct𝝀𝝃\mathrm{Inner\ Product}=\langle{\bm{\lambda}},\bm{\xi}\rangle.roman_Inner roman_Product = ⟨ bold_italic_λ , bold_italic_ξ ⟩ .

    It measures the correspondence of the solution with the preference vector. This is because the evaluation result ξisubscript𝜉𝑖\xi_{i}italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is expected to be large when λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is relatively large.

  3. 3.

    Sparsity (SP):

    SP=1m(N1)i=1N1𝝃~i𝝃~i+12.SP1𝑚𝑁1superscriptsubscript𝑖1𝑁1superscriptnormsuperscript~𝝃𝑖superscript~𝝃𝑖12\mathrm{SP}=\frac{1}{m(N-1)}\sum_{i=1}^{N-1}\|\tilde{\bm{\xi}}^{i}-\tilde{\bm{% \xi}}^{i+1}\|^{2}.roman_SP = divide start_ARG 1 end_ARG start_ARG italic_m ( italic_N - 1 ) end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ∥ over~ start_ARG bold_italic_ξ end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - over~ start_ARG bold_italic_ξ end_ARG start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

    This metric measures the mean squared distances between evaluation results 𝝃~isuperscript~𝝃𝑖\tilde{\bm{\xi}}^{i}over~ start_ARG bold_italic_ξ end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT sorted in a non-dominated sort order [15]. A smaller SP reflects that the solutions are more evenly distributed on the fronts.

  4. 4.

    Spacing:

    Spacing=1Ni=1N(diμ)2,μ=1Ni=1Ndi,di=minj[N],jiρ(𝝃i,𝝃j),formulae-sequenceSpacing1𝑁superscriptsubscript𝑖1𝑁superscriptsuperscript𝑑𝑖𝜇2formulae-sequence𝜇1𝑁superscriptsubscript𝑖1𝑁superscript𝑑𝑖superscript𝑑𝑖subscriptformulae-sequence𝑗delimited-[]𝑁𝑗𝑖𝜌superscript𝝃𝑖superscript𝝃𝑗\mathrm{Spacing}=\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left(d^{i}-\mu\right)^{2}},% \quad\mu=\frac{1}{N}\sum_{i=1}^{N}d^{i},\quad d^{i}=\min_{j\in[N],j\neq i}\rho% (\bm{\xi}^{i},\bm{\xi}^{j}),roman_Spacing = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_d start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , italic_μ = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_d start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = roman_min start_POSTSUBSCRIPT italic_j ∈ [ italic_N ] , italic_j ≠ italic_i end_POSTSUBSCRIPT italic_ρ ( bold_italic_ξ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_italic_ξ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ,

    where ρ𝜌\rhoitalic_ρ denotes Euclidean distance. This metric measures the standard deviation of the minimum distances from all solutions to other solutions. It also reflects the uniformity of the set of solutions.

Refer to caption
Figure 10: Comparison of reward distribution on eval dataset between the initial SFT model, i.e. before alignment, and Panacea with various preference vectors. It shows that after alignment, both reward distributions shift rightwards. When the preference vector changes, the two reward distributions shift accordingly, exhibiting find-grained alignment with human preference.

E.5 Additional Results

In this part, we provide some additional experimental results. In Figure 10, we compare reward distributions of the initial SFT model and Panacea for HH problem with Llama1-ft and RLHF, corresponding to Figure 3 (left). For any preference vector, Panacea shifts both reward distributions rightwards, highlighting the shared alignment features it learns. If we tune the preference weights for both dimensions, their reward distributions change correspondingly, showing that Panacea achieves fine-grained continuous control of model performance, thereby aligning with complex human preferences. Figure 14 shows the response of the model after preference shift, and more chat examples are provided in Appendix F. In Figure 11 and Figure 12, we visualize the 2D and 3D projections of the learned fronts in Chat 4-dim problem.

Refer to caption
Figure 11: Comparison of learned fronts on Chat 4-dim problem. We show 2D projections by setting two of preference weights to zero. They show that Panacea learns a superior front.
Refer to caption
Figure 12: Comparison of learned fronts on Chat 4-dim problem. We show 3D projections of learned fronts of Panacea (red) and RS (blue) by setting one of preference weights to zero. The dominance of Panacea is clear.

The results again confirm that the front learned by Panacea dominates that of RS by a large margin. Finally, we test the robustness of the preference adaptation strategy of Panacea and compare it with RS. Since the preference simplex is a low-dimensional space in msuperscript𝑚\mathbb{R}^{m}blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, we aim to see whether embedding preference vectors outside the simplex has a significant impact on the model performance. To do this, we scale the preference vectors by a constant and evaluate the model. Since RS first linearly interpolate the left, diagonal, and right matrices and then fuse them for inference,

Refer to caption
Figure 13: Robustness analysis of the preference adaptation strategy. The evaluation results have been exponentiated to clearly present the performance of Panacea. Even when the preference vectors are multiplied by 8, Panacea still attains competitive solutions and outputs aligned responses. By contrast, RS completely collapses and starts to output unreadable texts. This experiment supports the superior robustness of Panacea.

the resulting full incremental matrix is actually scaled by the cube of the constant. Thus for fair comparison, RS uses a constant of 2, and Panacea uses 8. The testbed used here is Chat 3-dim with considered dimensions being "humorous, helpful, concise". The results plotted in Figure 13 clearly demonstrates the superior robustness of Panacea. In addition, when we inspect the output responses, we find that Panacea is still generating aligned responses with the corresponding preference vector, while RS outputs become completely unreadable. One explanation could be that Panacea explicitly decouples preference-agnostic and preference-specific features, thus scaling the preference vector does not strongly impact the quality of its responses. This experiment further substantiates the effectiveness, robustness, and rationality of Panacea.

E.6 Information of assets

We present the information of assets as below:

  1. 1.

    Code

  2. 2.

    Data

  3. 3.

    Models

Refer to caption
Figure 14: This chat case from the helpful-harmless (HH) problem shows responses of Panacea to the same user prompt with 5 different preference vectors that are constantly shifting. Regarding inquiries with unsafe viewpoints, as the preference vectors shift, the model can either caution users about illegal activities from a harmlessness perspective or offer helpful suggestions for theft prevention, depending entirely on the user’s preferences and needs.

Appendix F Chat History Examples

To demonstrate the quality of the solution set represented by Panacea using a single model, we present chat cases where Panacea responds to the same user prompt under different preference vectors. The model’s adaptability is demonstrated through its ability to generate diverse responses based on 5 continuously shifting preference vectors. Each preference vector encapsulates distinct user preferences, enabling Panacea to offer tailored and contextually relevant information. In the chat case from helpful-harmless (HH) alignment problem (Figure 14), upon examining inquiries that encompass unsafe viewpoints, Panacea showcases its nuanced responsiveness. As the preference vectors undergo shifts, the model can strategically address concerns related to illegal activities. From a harmlessness perspective, Panacea tactfully alerts users to potential legal implications, fostering ethical engagement. Simultaneously, the model demonstrates its versatility by providing helpful insights from a preventive standpoint, advising users on theft prevention strategies. More examples are presented in Figure 15 and Figure 16, which are chat cases from the helpful-harmless-concise (HHC) and Chat 3-dim ("humorous, philosophical, helpful") problem. For each preference vector, Panacea outputs a response that is not only consistent with the vector but also Pareto optimal in the sense that it cannot be made better off in one dimension without negatively affecting the other dimensions. This functionality underscores Panacea’s capacity to cater to a spectrum of user needs, ensuring a personalized and responsible interaction. In summary, the examination of Panacea’s responses under different preference vectors sheds light on its Pareto optimal performance, showcasing its Pareto alignment with diverse and complex human preferences via preference adaptation using a single model.

Refer to caption
Figure 15: This chat case from the helpful-harmless-concise (HHC) problem shows responses of Panacea to the same user prompt with 5 different preference vectors. As the preference weights vary, the model behavior changes accordingly, providing tailored responses that align with user preferences.
Refer to caption
Figure 16: This chat case from the Chat 3-dim ("humorous, philosophical, helpful") problem shows how Panacea flexibly adapts to user-specified preference vectors. The preference weights continuously controls the model behavior.

Appendix G Discussions

G.1 Limitations

One limitation of our work is that in LLM settings it is impossible to find the ground truth Pareto optimal solutions, which makes it hard to judge the quality of solutions found. We tackle this limitation by comparing with DPS in Section 5.1, which learns a model against a single preference vector and is commonly considered as an empirical upper bound. Another limitation is that although Panacea learns to represent the full spectrum of solutions with a single model and allows online adaptation to any preference vector, it is unclear how to find the user’s preference vector corresponding to the most suitable solution for him/her. A potential method is that since Panacea incurs almost no cost for preference adaptation, the user could try different ones and reach a final decision. Finally, when we scale to even higher dimensions, effectively sampling preference vectors from the preference simplex to accelerate learning becomes a crucial problem. This is not addressed in this paper and could be a promising future work. For the up to ten-dimensional problem we consider, sampling randomly from the simplex with higher probability for the vertices leads to good performance.

G.2 Broader Impacts

By achieving Pareto alignment with diverse human preferences, Panacea holds the potential to alleviate biases against underrepresented groups and avoid marginalization, fostering a harmonious community where all individuals prosper. Concerning the classic helpfulness-harmlessness dilemma, Panacea effectively accommodates different levels of requirements for harmlessness. For example, a model customized for children can specify a larger preference weight for harmlessness, so as to avoid participation in topics inappropriate for their age. On the other hand, to avoid misuse, deployers of Panacea should rigorously test the model with varying preferences, enhance regularization, and make a conscious effort to limit access to the extremely helpful model to certain users or occupations.

NeurIPS Paper Checklist

  1. 1.

    Claims

  2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

  3. Answer: [Yes]

  4. Justification: In the abstract and introduction we have carefully phrased our contributions and scope. A summarization is provided in the last paragraph of the introduction.

  5. Guidelines:

    • The answer NA means that the abstract and introduction do not include the claims made in the paper.

    • The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.

    • The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.

    • It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

  6. 2.

    Limitations

  7. Question: Does the paper discuss the limitations of the work performed by the authors?

  8. Answer: [Yes]

  9. Justification: The limitations are discussed in Section G.1.

  10. Guidelines:

    • The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.

    • The authors are encouraged to create a separate "Limitations" section in their paper.

    • The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.

    • The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.

    • The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.

    • The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.

    • If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.

    • While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in develo** norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

  11. 3.

    Theory Assumptions and Proofs

  12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

  13. Answer: [Yes]

  14. Justification: We have clearly presented the assumptions and proofs for our theoretical results in Appendices A, B, C and D.

  15. Guidelines:

    • The answer NA means that the paper does not include theoretical results.

    • All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.

    • All assumptions should be clearly stated or referenced in the statement of any theorems.

    • The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.

    • Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.

    • Theorems and Lemmas that the proof relies upon should be properly referenced.

  16. 4.

    Experimental Result Reproducibility

  17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

  18. Answer: [Yes]

  19. Justification: We have described our method in detail in Section 4 and provided full experimental details in Appendix E.

  20. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.

    • If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.

    • Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.

    • While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example

      1. (a)

        If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.

      2. (b)

        If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.

      3. (c)

        If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

      4. (d)

        We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

  21. 5.

    Open access to data and code

  22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

  23. Answer: [Yes]

  24. Justification: As our method is developed based on the open-source Safe-RLHF codebase [14], we describe the core implementation in Section E.1 and present full experimental details in Appendix E. These should be sufficient to reproduce our results.

  25. Guidelines:

    • The answer NA means that paper does not include experiments requiring code.

    • Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.

    • While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).

    • The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.

    • The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.

    • The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.

    • At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).

    • Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

  26. 6.

    Experimental Setting/Details

  27. Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

  28. Answer: [Yes]

  29. Justification: We have specified all the training and test details necessary to understand the results in Sections 5 and E.

  30. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.

    • The full details can be provided either with the code, in appendix, or as supplemental material.

  31. 7.

    Experiment Statistical Significance

  32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

  33. Answer: [Yes]

  34. Justification: In Figure 3 (middle) we run one of our experiments across three seeds and observe consistent results, supporting the statistical significance of the experiments. Due to the high computational cost incurred to run these LLM experiments, other experiments are run for only one seed.

  35. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.

    • The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

    • The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

    • The assumptions made should be given (e.g., Normally distributed errors).

    • It should be clear whether the error bar is the standard deviation or the standard error of the mean.

    • It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.

    • For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).

    • If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

  36. 8.

    Experiments Compute Resources

  37. Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

  38. Answer: [Yes]

  39. Justification: In Appendix E we state that all our experiments are run on an 8×\times×A800-80GB GPU server and we present our training epochs.

  40. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.

    • The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.

    • The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).

  41. 9.

    Code Of Ethics

  42. Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?

  43. Answer: [Yes]

  44. Justification: The research conducted in the paper conforms, in every respect, with the NeurIPS Code of Ethics.

  45. Guidelines:

    • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.

    • If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.

    • The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

  46. 10.

    Broader Impacts

  47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

  48. Answer: [Yes]

  49. Justification: The broader impacts of our work are discussed in Section G.2.

  50. Guidelines:

    • The answer NA means that there is no societal impact of the work performed.

    • If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

    • Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

    • The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.

    • The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.

    • If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

  51. 11.

    Safeguards

  52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

  53. Answer: [N/A]

  54. Justification: Our paper does not release any data or models.

  55. Guidelines:

    • The answer NA means that the paper poses no such risks.

    • Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.

    • Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.

    • We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

  56. 12.

    Licenses for existing assets

  57. Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

  58. Answer: [Yes]

  59. Justification: We list the citations, licenses, and the URLs of all our used assets in Section E.6.

  60. Guidelines:

    • The answer NA means that the paper does not use existing assets.

    • The authors should cite the original paper that produced the code package or dataset.

    • The authors should state which version of the asset is used and, if possible, include a URL.

    • The name of the license (e.g., CC-BY 4.0) should be included for each asset.

    • For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

    • If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

    • For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

    • If this information is not available online, the authors are encouraged to reach out to the asset’s creators.

  61. 13.

    New Assets

  62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

  63. Answer: [N/A]

  64. Justification: Our paper does not release new assets.

  65. Guidelines:

    • The answer NA means that the paper does not release new assets.

    • Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.

    • The paper should discuss whether and how consent was obtained from people whose asset is used.

    • At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

  66. 14.

    Crowdsourcing and Research with Human Subjects

  67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

  68. Answer: [N/A]

  69. Justification: Our paper does not involve crowdsourcing nor research with human subjects.

  70. Guidelines:

    • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

    • Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.

    • According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

  71. 15.

    Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

  72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

  73. Answer: [N/A]

  74. Justification: Our paper does not involve crowdsourcing nor research with human subjects.

  75. Guidelines:

    • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

    • Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

    • We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.

    • For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.