Triple Preference Optimization:
Achieving Better Alignment with Less Data in a Single Step Optimization

Amir Saeidi  Shivanshu Verma  Aswin RRV  Chitta Baral
Arizona State University
{ssaeidi1, sverma76, aravik13, cbaral}@asu.edu
Abstract

Large Language Models (LLMs) perform well across diverse tasks, but aligning them with human demonstrations is challenging. Recently, Reinforcement Learning (RL)-free methods like Direct Preference Optimization (DPO) have emerged, offering improved stability and scalability while retaining competitive performance relative to RL-based methods. However, while RL-free methods deliver satisfactory performance, they require significant data to develop a robust Supervised Fine-Tuned (SFT) model and an additional step to fine-tune this model on a preference dataset, which constrains their utility and scalability. In this paper, we introduce Triple Preference Optimization (TPO), a new preference learning method designed to align an LLM with three preferences without requiring a separate SFT step and using considerably less data. Through a combination of practical experiments and theoretical analysis, we show the efficacy of TPO as a single-step alignment strategy. Specifically, we fine-tuned the Phi-2 (2.7B) and Mistral (7B) models using TPO directly on the UltraFeedback dataset, achieving superior results compared to models aligned through other methods such as SFT, DPO, KTO, IPO, CPO, and ORPO. Moreover, the performance of TPO without the SFT component led to notable improvements in the MT-Bench score, with increases of +1.27 and +0.63 over SFT and DPO, respectively. Additionally, TPO showed higher average accuracy, surpassing DPO and SFT by 4.2% and 4.97% on the Open LLM Leaderboard benchmarks. Our code is publicly available at https://github.com/sahsaeedi/triple-preference-optimization.

footnotetext: Corresponding author. * Equal contribution.
Refer to caption
Figure 1: Comparison of the loss functions of TPO and DPO. TPO’s loss function incorporates two main objectives. Its first term optimizes the log probability of preferences (preference(πθ)subscriptpreferencesubscript𝜋𝜃\mathcal{L}_{\mathrm{preference}}\left(\pi_{\theta}\right)caligraphic_L start_POSTSUBSCRIPT roman_preference end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT )), which demonstrates that optimizing preferences doesn’t necessitate a reference model (See Section 3). Through its second term, TPO aims to learn the gold standard response (referencesubscriptreference\mathcal{L}_{\mathrm{reference}}caligraphic_L start_POSTSUBSCRIPT roman_reference end_POSTSUBSCRIPT). This aspect of the loss function is regulated by a parameter α𝛼\alphaitalic_α, which serves as a parameter controlling the extent to which the policy model learns the gold standard response.

1 Introduction

Refer to caption
Figure 2: (a) During the SFT step, a pre-trained model is fine-tuned to align with human expectations. (b) To further enhance the performance of the SFT model, we train it with human preferences using reinforcement learning. (c) Alternatively, we can directly align an SFT model with human preferences using RL-free methods such as DPO. (d) In TPO, we merge preference optimization with gold standard response learning, enabling direct fine-tuning of a pre-trained model based on three preferences.

LLMs are trained across a wide array of tasks, demonstrating their remarkable versatility in solving diverse tasks Brown et al. (2020); Narayanan et al. (2021); Bubeck et al. (2023). However, their training on data of varying quality can lead to many issues, such as the generation of toxic or harmful text under certain contexts Perez et al. (2022); Ganguli et al. (2022), and in general, the generation of outputs that are not desired by humans. Hence, it is crucial to align LLMs with human expectations and preferences that prioritize their helpfulness, honesty, and harmlessness Bai et al. (2022).

Supervised Fine-Tuning (SFT) is a direct alignment method that involves fitting a model to human-written data Sanh et al. (2022). However, this approach fails to fully impart the human perspective to the model. During training, the model only receives a reference response for each input, thus lacking exposure to incorrect answers and preferences, which ultimately constrains its performance on downstream tasks Touvron et al. (2023).

A prominent method in AI alignment for LLMs is Reinforcement Learning with Human Feedback (RLHF) Ouyang et al. (2022). Despite its impressive performance relative to SFT, RLHF faces limitations such as instability and susceptibility to reward hacking Liu et al. (2024). Consequently, a recent approach called Direct Preference Optimization (DPO) Rafailov et al. (2023) has emerged. DPO is an RL-free method that directly optimizes human preferences by shifting from RL to simple binary cross-entropy. However, DPO encounters several limitations: 1) high dependency on the SFT part Tunstall et al. (2023), 2) tendency to overfit beyond a single epoch Azar et al. (2023), and 3) inefficient learning and memory utilization Xu et al. (2024).

To address these limitations, Various alignment methods have been proposed for dialogue systems Tunstall et al. (2023), harmful and helpfulness question answering Wu et al. (2023), summarization Zhao et al. (2023), and translation Xu et al. (2024) and all these studies include a separate SFT component. During SFT, models are fine-tuned to generate appropriate responses to the corresponding input prompts. Meanwhile, in DPO, models are fine-tuned to enhance the likelihood of generating preferred responses over less desirable ones and not to stray far away from the SFT model Rafailov et al. (2023).

In this paper, we introduce the Triple Preference Optimization (TPO), a new preference learning approach. In TPO, we combine the two separate optimization steps (supervised fine-tuning and preference learning) into a single step based on Pareto Front concept Lotov and Miettinen (2008), with the training data having both the gold standard response (as in SFT) and the preferences (as in PPO/DPO) in a consolidated format. Thus, our training data will be of the form (input prompt, gold standard response (yref)subscript𝑦𝑟𝑒𝑓(y_{ref})( italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ), preferred response (yw)subscript𝑦𝑤(y_{w})( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ), less-preferred response (yl)subscript𝑦𝑙(y_{l})( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT )). Specifically, we jointly optimize a policy model with 𝔼(x,yref)𝒟[logπθ(yrefx)]subscript𝔼similar-to𝑥subscript𝑦𝑟𝑒𝑓𝒟delimited-[]subscript𝜋𝜃conditionalsubscript𝑦𝑟𝑒𝑓𝑥-\mathbb{E}_{\left(x,y_{ref}\right)\sim\mathcal{D}}\left[\log\pi_{\theta}\left% (y_{ref}\mid x\right)\right]- blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ∣ italic_x ) ] and 𝔼(x,yw,yl)𝒟[logσ(βlogπθ(ywx)βlogπθ(ylx))]\begin{aligned} -\mathbb{E}_{\left(x,y_{w},y_{l}\right)\sim\mathcal{D}}\left[% \operatorname{log}\sigma\left(\beta\log{\pi_{\theta}\left(y_{w}\mid x\right)}% \right.\right.&\left.\left.-\beta\log{\pi_{\theta}\left(y_{l}\mid x\right)}% \right)\right]\end{aligned}start_ROW start_CELL - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_β roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∣ italic_x ) end_CELL start_CELL - italic_β roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∣ italic_x ) ) ] end_CELL end_ROW in one step (See Figure 1).

Our results show that TPO exhibits impressive performance compared to SFT across various benchmarks and outperforms other alignment methods such as DPO. Specifically, Mistral (7B), fine-tuned by TPO and trained with six times less data than other alignment techniques, outperforms SFT, DPO, KTO, IPO, CPO, and ORPO across nine benchmarks on the Open LLM Leaderboard. Notably, Mistral aligned with TPO achieved a +0.72 increase in the MT-Bench score over SFT.

Overall, TPO addresses two key shortcomings in alignment tasks. Firstly, by removing πrefsubscript𝜋𝑟𝑒𝑓\pi_{ref}italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT justified in Section 3, TPO mitigates the inefficient learning and memory utilization issues observed in DPO, IPO, and KTO, allowing for more computational efficiency with less memory usage. Secondly, TPO enhances performance over SFT and other alignment methods by maximizing the likelihood of gold response, regularized by parameter 𝜶𝜶\boldsymbol{\alpha}bold_italic_α. and simultaneously optimizing between two preferences (preferred and less-preferred responses). Despite TPO’s need for three preferences and its higher cost relative to other methods, our findings reveal that it’s possible to considerably lessen the training data required and still achieve superior outcomes (See Table 1).

Our findings suggest that a separate SFT step is not necessary for TPO and, in certain scenarios, having one may even hinder TPO’s performance (See Tables 1 and 2).

We summarize our primary contributions as follows:

  1. 1.

    We propose a new preference learning method called Triple Preferences Optimization (TPO) that simplifies the alignment process and reduces two stages to one stage.

  2. 2.

    Theoretically, we derive the TPO objective and show that combining the human expectation data and preference dataset achieves better performance.

  3. 3.

    Comprehensive experiments reveal that the TPO method, applied to two distinct baseline models—Mistral (7 B) and Phi-2 (2.7 B)—outperforms SFT, KTO, IPO, DPO, CPO, and ORPO in terms of performance across ten different benchmarks (refer to Tables 1, 2, and 3).

  4. 4.

    Integrating the SFT step with the preference alignment step and moderating it with a regularization parameter (α𝛼\alphaitalic_α) enhances the model’s performance while reducing the data required for training (See Figure 3).

2 Related Works

The performance of Large Language Models (LLMs) on a variety of tasks is remarkable Anil et al. (2023). Nonetheless, effectively aligning LLMs remains a significant challenge. Current studies have fine-tuned LLMs using datasets of human preferences, leading to improvements in translation Kreutzer et al. (2018), summarization Stiennon et al. (2022), story-telling Ziegler et al. (2019), instruction-following Ramamurthy et al. (2023), and dialogue systems.

RLHF Christiano et al. (2023), introduced in the literature, aims to optimize for maximum reward by interacting with a reward model trained using the Bradley-Terry (BT) model Bong and Rinaldo (2022), typically through reinforcement algorithms like Proximal Policy Optimization (PPO) Schulman et al. (2017). While RLHF enhances model performance, it faces challenges such as instability, reward hacking, and scalability inherent in reinforcement learning. Recent works have presented techniques to overcome these challenges by optimizing relative preferences without relying on reinforcement learning. Utilizing the Bradley-Terry (BT) model to optimize a model on preference datasets is instrumental in ensuring alignment with human preferences.

SLiC Zhao et al. (2023) introduced a novel method for ranking preferences generated by a supervised fine-tuned (SFT) model, incorporating calibration loss and regularization fine-tuning loss during training. Meanwhile, RRHF Yuan et al. (2023) trains the SFT model using a zero-margin likelihood contrastive loss, assuming multiple ranked responses for each input. While both SLiC and RRHF are effective, they lack theoretical foundations. In contrast, DPO offers a method to directly fit an SFT model to human preferences using the Bradley-Terry (BT) model, providing theoretical insights into the alignment process.

RSO Liu et al. (2024) merges the techniques of SLiC and DPO while introducing an improved approach for collecting preference pairs through statistical rejection sampling. IPO Azar et al. (2023) has mathematically revealed the limitations of the DPO approach concerning overfitting and generalization. It proposes a comprehensive objective for learning from human preferences. Zephyr Tunstall et al. (2023) has improved DPO by utilizing the distillation method.

KTO Ethayarajh et al. (2023), drawing inspiration from Kahneman and Tversky’s influential work on prospect theory Tversky and Kahneman (1992), seeks to maximize the utility of LLM outputs directly rather than optimizing the log-likelihood of preferences. By prioritizing the determination of whether a preference is desirable or undesirable, this method eliminates the requirement for two preferences for the same input.

Recently, CPO Xu et al. (2024) introduced an efficient method for learning preferences by combining maximum-likelihood loss with the DPO loss function, aiming to improve memory usage and learning efficiency. Additionally, ORPO Hong et al. (2024) proposed a novel approach by incorporating a penalty term to prevent the learning of unpreferred responses while enhancing the likelihood of learning preferred responses.

We observe two primary challenges in the alignment process addressed by the aforementioned studies. Firstly, alignment methods such as DPO require an SFT part or have better performance with an SFT part. Secondly, there are concerns regarding inefficient learning and memory usage. While the CPO has proven to be an effective learning approach, a conflict between its objectives may restrict the policy model’s performance. In this research, we investigate these limitations and seek to introduce a new algorithm to address them.

3 Triple Preference Optimization

In this section, we introduce Triple Preference Optimization (TPO), a new approach to preference learning. This method optimizes a policy model (πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT) by maximizing the likelihood of the gold response and optimizing for the preferences simultaneously.

Typically, in NLP tasks, we utilize a dataset Dreference={xi,yrefi}i=1Nsubscript𝐷𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒subscriptsuperscriptsuperscript𝑥𝑖superscriptsubscript𝑦𝑟𝑒𝑓𝑖𝑁𝑖1D_{reference}=\{x^{i},y_{ref}^{i}\}^{N}_{i=1}italic_D start_POSTSUBSCRIPT italic_r italic_e italic_f italic_e italic_r italic_e italic_n italic_c italic_e end_POSTSUBSCRIPT = { italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT, where x𝑥xitalic_x is the input and yrefsubscript𝑦𝑟𝑒𝑓y_{ref}italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT is the gold standard response, crafted by humans or large models like GPT-4 and validated by humans. Additionally, for applying preference optimization methods, a dataset Dpreference={xi,ywi,yli}i=1Nsubscript𝐷𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒subscriptsuperscriptsuperscript𝑥𝑖superscriptsubscript𝑦𝑤𝑖superscriptsubscript𝑦𝑙𝑖𝑁𝑖1D_{preference}=\{x^{i},y_{w}^{i},y_{l}^{i}\}^{N}_{i=1}italic_D start_POSTSUBSCRIPT italic_p italic_r italic_e italic_f italic_e italic_r italic_e italic_n italic_c italic_e end_POSTSUBSCRIPT = { italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT is needed, where ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT are the preferred and unpreferred responses respectively, generated by smaller models such as LLaMA-3. The aim of TPO is to optimize three preferences concurrently. To achieve this, we merge the reference𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒referenceitalic_r italic_e italic_f italic_e italic_r italic_e italic_n italic_c italic_e and preference𝑝𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒preferenceitalic_p italic_r italic_e italic_f italic_e italic_r italic_e italic_n italic_c italic_e datasets into one dataset DTPO={xi,yrefi,ywi,yli}i=1Nsubscript𝐷𝑇𝑃𝑂subscriptsuperscriptsuperscript𝑥𝑖superscriptsubscript𝑦𝑟𝑒𝑓𝑖superscriptsubscript𝑦𝑤𝑖superscriptsubscript𝑦𝑙𝑖𝑁𝑖1D_{TPO}=\{x^{i},y_{ref}^{i},y_{w}^{i},y_{l}^{i}\}^{N}_{i=1}italic_D start_POSTSUBSCRIPT italic_T italic_P italic_O end_POSTSUBSCRIPT = { italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT, establishing a response hierarchy of yrefywylsucceedssubscript𝑦𝑟𝑒𝑓subscript𝑦𝑤succeedssubscript𝑦𝑙y_{ref}\succ y_{w}\succ y_{l}italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. Further details on the TPO objective will be discussed in the following subsection.

3.1 Deriving the TPO objective

Motivated by the goal of simplifying the alignment process to a single step and enhancing the learning mechanisms of the DPO, we derive the TPO objective. We start with a simple RL objective for aligning an LLM parameterized with θ𝜃\thetaitalic_θ, represented as πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT with preferences. The RL objective is just maximizing the expected reward Ziegler et al. (2019) as shown in Equation 1:

maxπθ[𝔼x𝒟,yπθ(y|x)[rϕ(x,y)]]subscriptsubscript𝜋𝜃subscript𝔼formulae-sequencesimilar-to𝑥𝒟similar-to𝑦subscript𝜋𝜃conditional𝑦𝑥delimited-[]subscript𝑟italic-ϕ𝑥𝑦\begin{split}\max_{\pi_{\theta}}&\left[\mathbb{E}_{x\sim\mathcal{D},y\sim\pi_{% \theta}(y|x)}[r_{\phi}(x,y)]\right]\end{split}start_ROW start_CELL roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL [ blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D , italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_POSTSUBSCRIPT [ italic_r start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_y ) ] ] end_CELL end_ROW (1)

where rϕsubscript𝑟italic-ϕr_{\phi}italic_r start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT represents the expected reward that the model receives for a given input x𝑥xitalic_x and output y𝑦yitalic_y. However, maximizing the reward without constraints can lead to distribution collapse in an LLM. Drawing inspiration from the Maximum Entropy Reinforcement Learning (MERL) framework Hejna et al. (2023), we have modified the RLHF objective, as detailed in Equation 4. The MERL framework aims to maximize causal entropy alongside the expected reward. This objective is formally defined in Equation 2.

maxπθ𝔼x𝒟[𝔼yπθ(y|x)[rϕ(x,y)]+βπθ(y|x)]subscriptsubscript𝜋𝜃subscript𝔼similar-to𝑥𝒟delimited-[]subscript𝔼similar-to𝑦subscript𝜋𝜃conditional𝑦𝑥delimited-[]subscript𝑟italic-ϕ𝑥𝑦𝛽subscriptsubscript𝜋𝜃conditional𝑦𝑥\begin{split}\max_{\pi_{\theta}}&\mathbb{E}_{x\sim\mathcal{D}}\left[\mathbb{E}% _{y\sim\pi_{\theta}(y|x)}[r_{\phi}(x,y)]+\beta\mathcal{H}_{\pi_{\theta}}(y|x)% \right]\end{split}start_ROW start_CELL roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_POSTSUBSCRIPT [ italic_r start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_y ) ] + italic_β caligraphic_H start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y | italic_x ) ] end_CELL end_ROW (2)

By definition of Entropy,

πθ(y|x)=yπθ(y|x)log(πθ(y|x))subscriptsubscript𝜋𝜃conditional𝑦𝑥subscript𝑦subscript𝜋𝜃conditional𝑦𝑥𝑙𝑜𝑔subscript𝜋𝜃conditional𝑦𝑥\begin{split}\mathcal{H}_{\pi_{\theta}}(y|x)=-\sum_{y}\pi_{\theta}(y|x)log(\pi% _{\theta}(y|x))\end{split}start_ROW start_CELL caligraphic_H start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y | italic_x ) = - ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) italic_l italic_o italic_g ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) ) end_CELL end_ROW (3)

The objective becomes,

maxπθ𝔼x𝒟,yπθ(y|x)[rϕ(x,y)βlogπθ(y|x)]subscriptsubscript𝜋𝜃subscript𝔼formulae-sequencesimilar-to𝑥𝒟similar-to𝑦subscript𝜋𝜃conditional𝑦𝑥delimited-[]subscript𝑟italic-ϕ𝑥𝑦𝛽subscript𝜋𝜃conditional𝑦𝑥\begin{split}\max_{\pi_{\theta}}&\mathbb{E}_{x\sim\mathcal{D},y\sim\pi_{\theta% }(y|x)}\left[r_{\phi}(x,y)-\beta\log\pi_{\theta}(y|x)\right]\end{split}start_ROW start_CELL roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D , italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_POSTSUBSCRIPT [ italic_r start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_y ) - italic_β roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) ] end_CELL end_ROW (4)

Based on this, the optimal policy model induced by a reward function r(x,y)𝑟𝑥𝑦r(x,y)italic_r ( italic_x , italic_y ) could be derived as shown in Equation 5 (See Appendix A.1). It takes the following form:

πr(y|x)=1Z(x)exp(1βr(x,y))subscript𝜋𝑟conditional𝑦𝑥1𝑍𝑥1𝛽𝑟𝑥𝑦\displaystyle\pi_{r}(y|x)=\frac{1}{Z(x)}\exp{\big{(}\frac{1}{\beta}r(x,y)\big{% )}}italic_π start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_y | italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_Z ( italic_x ) end_ARG roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) (5)

where Z(x)=yexp(1βr(x,y))𝑍𝑥subscript𝑦1𝛽𝑟𝑥𝑦Z(x)=\sum_{y}\exp{\big{(}\frac{1}{\beta}r(x,y)\big{)}}italic_Z ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) is the new partition function. Inspired by Rafailov et al. (2023), we show that the reward function, in terms of the optimal policy that it induces, is calculated as per Equation 6 given below:

r(x,y)=βlogπr(y|x)+βlogZ(x)𝑟𝑥𝑦𝛽subscript𝜋𝑟conditional𝑦𝑥𝛽𝑍𝑥\displaystyle r(x,y)=\beta\log\pi_{r}(y|x)+\beta\log Z(x)italic_r ( italic_x , italic_y ) = italic_β roman_log italic_π start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_y | italic_x ) + italic_β roman_log italic_Z ( italic_x ) (6)

Subsequently, we can represent the ground-truth reward r(x,y)superscript𝑟𝑥𝑦r^{\ast}(x,y)italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x , italic_y ) in the form of its corresponding optimal policy πsuperscript𝜋\pi^{\ast}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that it induces.

Since the Bradley-Terry model is dependent only on the difference between the two reward functions, i.e., p(yw>yl|x)=σ(r(x,yw)r(x,yl))superscript𝑝subscript𝑦𝑤conditionalsubscript𝑦𝑙𝑥𝜎superscript𝑟𝑥subscript𝑦𝑤superscript𝑟𝑥subscript𝑦𝑙p^{\ast}(y_{w}>y_{l}|x)=\sigma(r^{\ast}(x,y_{w})-r^{\ast}(x,y_{l}))italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT > italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) = italic_σ ( italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) - italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ), where, we can reparameterize it as follows in Equation 7:

p(yw>ylx)=superscript𝑝subscript𝑦𝑤conditionalsubscript𝑦𝑙𝑥absent\displaystyle p^{\ast}(y_{w}>y_{l}\mid x)=italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT > italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∣ italic_x ) = σ(βlogπ(ywx)\displaystyle\ \sigma\bigg{(}\beta\log\pi^{\ast}(y_{w}\mid x)italic_σ ( italic_β roman_log italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∣ italic_x ) (7)
βlogπ(ylx))\displaystyle-\beta\log\pi^{\ast}(y_{l}\mid x)\bigg{)}- italic_β roman_log italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∣ italic_x ) )

Similar to the reward modeling approach, we model the human preferences, which is now in terms of a parameterized policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Thus, we formulate maximum-likelihood objective (preference objective) for a dataset D={xi,ywi,yli}i=1N𝐷subscriptsuperscriptsuperscript𝑥𝑖subscriptsuperscript𝑦𝑖𝑤subscriptsuperscript𝑦𝑖𝑙𝑁𝑖1D=\{x^{i},y^{i}_{w},y^{i}_{l}\}^{N}_{i=1}italic_D = { italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT as outlined in Equation 8:

preference(πθ)=subscriptpreferencesubscript𝜋𝜃absent\displaystyle\mathcal{L}_{\mathrm{preference}}\left(\pi_{\theta}\right)=caligraphic_L start_POSTSUBSCRIPT roman_preference end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = 𝔼(x,yw,yl)𝒟subscript𝔼similar-to𝑥subscript𝑦𝑤subscript𝑦𝑙𝒟\displaystyle-\mathbb{E}_{(x,y_{w},y_{l})\sim\mathcal{D}}- blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT (8)
[logσ(βlogπθ(ywx)\displaystyle\Big{[}\log\sigma\Big{(}\beta\log\pi_{\theta}(y_{w}\mid x)[ roman_log italic_σ ( italic_β roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∣ italic_x )
βlogπθ(ylx))]\displaystyle-\beta\log\pi_{\theta}(y_{l}\mid x)\Big{)}\Big{]}- italic_β roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∣ italic_x ) ) ]

Looking at the Equation 8, the objective is fitting an reward which is reparameterized as r(x,y)=βlogπ(y|x)𝑟𝑥𝑦𝛽𝜋conditional𝑦𝑥r(x,y)=\beta\log\pi(y|x)italic_r ( italic_x , italic_y ) = italic_β roman_log italic_π ( italic_y | italic_x ). In section 3.2, we theoretically explain that fitting this reward would ultimately recover the optimal policy.

The comparison between the loss function in Equation 8 and the DPO loss function indicates that the new function is more efficient because it requires only one model during training. However, even though maximizing the objective under the MERL setting prevents distribution collapse, it trains a pessimistic model, which also limits the model from learning the preferred responses effectively. To counteract this limitation, we maximize the likelihood of the gold response. The adjustment is specified in Equation 9.

reference=𝔼(x,yref)𝒟[logπθ(yrefx)]subscriptreferencesubscript𝔼similar-to𝑥subscript𝑦𝑟𝑒𝑓𝒟delimited-[]subscript𝜋𝜃conditionalsubscript𝑦𝑟𝑒𝑓𝑥\mathcal{L}_{\mathrm{reference}}=-\mathbb{E}_{\left(x,y_{ref}\right)\sim% \mathcal{D}}\left[\log\pi_{\theta}\left(y_{ref}\mid x\right)\right]caligraphic_L start_POSTSUBSCRIPT roman_reference end_POSTSUBSCRIPT = - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ∣ italic_x ) ] (9)

Based on Equations 8, and 9, the TPO is defined as a multi-objective (bi-objective) optimization problem as supported by Pareto Front concept Lotov and Miettinen (2008). The TPO loss function is framed as follows:

TPO=preference+αreferencesubscriptTPOsubscriptpreference𝛼subscriptreference\mathcal{L}_{\mathrm{TPO}}=\mathcal{L}_{\text{preference}}+\alpha\mathcal{L}_{% \mathrm{reference}}caligraphic_L start_POSTSUBSCRIPT roman_TPO end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT preference end_POSTSUBSCRIPT + italic_α caligraphic_L start_POSTSUBSCRIPT roman_reference end_POSTSUBSCRIPT (10)

where hyper-parameter (α𝛼\alphaitalic_α) plays a crucial role in moderating the model’s learning of the gold response. The impact of the α𝛼\alphaitalic_α on the model’s performance is detailed in Section 4.3.

Insights into the TPO update.

A deeper mechanistic understanding of TPO can be achieved by analyzing the gradient of the TPOsubscriptTPO\mathcal{L}_{\mathrm{TPO}}caligraphic_L start_POSTSUBSCRIPT roman_TPO end_POSTSUBSCRIPT loss function. The expression of this gradient in relation to the parameters θ𝜃\thetaitalic_θ is as follows:

θTPO=subscript𝜃subscriptTPOabsent\displaystyle\nabla_{\theta}\mathcal{L}_{\text{TPO}}=∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT TPO end_POSTSUBSCRIPT = 𝔼(x,yref,yw,yl)𝒟[αθlogπ(yref|x)increase likelihood of yref\displaystyle-\mathds{E}_{(x,y_{ref},y_{w},y_{l})\sim\mathcal{D}}\;[% \underbrace{\alpha\nabla_{\theta}\log\pi(y_{ref}|x)}_{\text{increase % likelihood of $y_{ref}$}}- blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ under⏟ start_ARG italic_α ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π ( italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT | italic_x ) end_ARG start_POSTSUBSCRIPT increase likelihood of italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT
+βσ(βlogπθ(yl|x)βlogπθ(yw|x)increase weight when reward estimate is wrong)𝛽𝜎subscript𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥increase weight when reward estimate is wrong\displaystyle+\beta\sigma(\underbrace{\beta\log\pi_{\theta}(y_{l}|x)-\beta\log% \pi_{\theta}(y_{w}|x)}_{\text{increase weight when reward estimate is wrong}})+ italic_β italic_σ ( under⏟ start_ARG italic_β roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) - italic_β roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_POSTSUBSCRIPT increase weight when reward estimate is wrong end_POSTSUBSCRIPT )
×[θlogπ(yw|x)increase likelihood of ywθlogπ(yl|x)decrease likelihood of yl]]\displaystyle\times[\underbrace{\nabla_{\theta}\log\pi(y_{w}|x)}_{\text{% increase likelihood of $y_{w}$}}-\underbrace{\nabla_{\theta}\log\pi(y_{l}|x)}_% {\text{decrease likelihood of $y_{l}$}}]]× [ under⏟ start_ARG ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_POSTSUBSCRIPT increase likelihood of italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT - under⏟ start_ARG ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_POSTSUBSCRIPT decrease likelihood of italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] ] (11)

where r(x,y)=βlogπθ(yx)𝑟𝑥𝑦𝛽subscript𝜋𝜃conditional𝑦𝑥r(x,y)=\beta\log\pi_{\theta}\left(y\mid x\right)italic_r ( italic_x , italic_y ) = italic_β roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y ∣ italic_x ) is the reward inherently determined by the policy model πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Intuitively, the gradient of the TPO loss function works to increase the likelihood of the gold completions yrefsubscript𝑦𝑟𝑒𝑓y_{ref}italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT, simultaneously enhancing the preference aspect by amplifying the likelihood of preferred completions ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and reducing the likelihood of the less-preferred completions ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, which are weighed by how incorrectly the implicit reward model orders the preferences. (more details on Appendix A.2). Notably, the hyper-parameters β𝛽\betaitalic_β and α𝛼\alphaitalic_α significantly influence the performance of the policy model, as discussed further in Section 4.3.

Model Align ARC TruthfulQA Winogrande HellaSwag MMLU Average
Mistral SFT 60.41 43.73 74.19 81.69 60.92 64.18
Mistral+SFT DPO 59.04 46.70 76.63 82.10 60 64.91
Mistral+SFT IPO 59.30 42.22 76.4 81.02 59.93 63.77
Mistral+SFT KTO 57.84 49.88 76.47 81.61 59.73 65.1
Mistral+SFT CPO 57.50 53.22 75.92 80.37 58.41 65.08
Mistral ORPO 58.61 52.77 77.5 82.04 63.26 66.83
Mistral+SFT TPO (our) 58.02 59.05 76.47 80.6 59.48 66.72
Mistral TPO (our α=1𝛼1\alpha=1italic_α = 1 | β=0.1𝛽0.1\beta=0.1italic_β = 0.1) 61.34 60 78.21 83.18 63.18 69.18
Mistral TPO (our α=0.9𝛼0.9\alpha=0.9italic_α = 0.9 | β=0.2𝛽0.2\beta=0.2italic_β = 0.2) 60.23 57.34 78.29 83.01 63.75 68.52
Table 1: Comparing TPO’s performance with other alignment methods reveals that the Mistral+TPO model exhibits comparable performance across different benchmarks and, on average, outperforms other methods. In particular, Mistral+TPO performed remarkably on the TruthfulQA benchmark. It’s worth noting that the Mistral+TPO model is directly trained with TPO, which contributes to its superior performance. Additionally, for all benchmarks, accuracy is the metric used to gauge performance.

3.2 Theory behind TPO

In this section, we provide a theoretical foundation for the TPO algorithm, drawing inspiration from Rafailov et al. (2023). We observe that the preference optimization objective aligns with the principles of a Bradley-Terry model, where the reward parameterization is defined as r(x,y)=βlogπθ(y|x)𝑟𝑥𝑦𝛽subscript𝜋𝜃conditional𝑦𝑥r(x,y)=\beta\log\pi_{\theta}(y|x)italic_r ( italic_x , italic_y ) = italic_β roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ). Consequently, we optimize our parametric model πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT in a manner similar to reward model optimization, as shown by Ouyang et al. (2022). We expand on the theory underlying this reparameterization of the reward function, illustrating that it does not constrain the range of reward models that can be modeled and ensures accurate retrieval of the optimal policy. We initiate this discussion by following the insights presented in DPO about the equivalent class of reward models.

Definition 3.1

Two reward functions r(x,y)𝑟𝑥𝑦r(x,y)italic_r ( italic_x , italic_y ) and r(x,y)superscript𝑟𝑥𝑦r^{{}^{\prime}}(x,y)italic_r start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_y ) are equivalent iff r(x,y)r(x,y)=g(x)𝑟𝑥𝑦superscript𝑟𝑥𝑦𝑔𝑥r(x,y)-r^{{}^{\prime}}(x,y)=g(x)italic_r ( italic_x , italic_y ) - italic_r start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_y ) = italic_g ( italic_x ) for some function g𝑔gitalic_g.

We can state the following two lemmas as it is apparent that there exists an equivalence relation, dividing the set of reward functions into distinct classes.

Lemma 3.1

Under the Plackett-Luce, and in particular the Bradley-Terry preference framework, two reward functions from the same class induce the same preference distribution. Rafailov et al. (2023)

Lemma 3.2

Two reward functions from the same equivalence class induce the same optimal policy under the constrained RL problem. Rafailov et al. (2023)

The proofs are shown in Appendix A.3.

Theorem 3.1

Under mild assumptions, all reward classes consistent with Plackett-Luce models can be represented with the reparameterization r(x,y)=βlogπ(y|x)𝑟𝑥𝑦𝛽𝜋conditional𝑦𝑥r(x,y)=\beta\log\pi(y|x)italic_r ( italic_x , italic_y ) = italic_β roman_log italic_π ( italic_y | italic_x ) for some model π(y|x)𝜋conditional𝑦𝑥\pi(y|x)italic_π ( italic_y | italic_x ). Rafailov et al. (2023)

As proposed in DPO, upon imposing certain constraints on the under-constrained Plackett-Luce family of preference models, such that we preserve the class of representable reward model, it possible to explicitly make the optimal policy in Equation 5 analytically tractable for all prompts x𝑥xitalic_x. The theorem is elaborated in Appendix A.4. We further elaborate our theoretical basis for defining and optimally addressing the TPO objective within a multi-objective optimization framework.

Definition 3.2

Let fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT objective, 𝒮𝒮\mathcal{S}caligraphic_S denote the feasible policy space, then in a multi-objective optimization setting, a policy π𝒮superscript𝜋𝒮\pi^{\ast}\in\mathcal{S}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_S is said to be Pareto optimal if there does not exist another policy π𝒮𝜋𝒮\pi\in\mathcal{S}italic_π ∈ caligraphic_S such that fi(π)fi(π)subscript𝑓𝑖𝜋subscript𝑓𝑖superscript𝜋f_{i}(\pi)\leq f_{i}(\pi^{\ast})italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_π ) ≤ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) for all i=1,,k𝑖1𝑘i=1,...,kitalic_i = 1 , … , italic_k and fj(π)<fj(π)subscript𝑓𝑗𝜋subscript𝑓𝑗superscript𝜋f_{j}(\pi)<f_{j}(\pi^{\ast})italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_π ) < italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) for at least one index j.

Looking at the objectives in Equation 8 and Equation 9, it is obvious that optimizing them together is non-trivial; that is, there does exist a policy that is optimal with respect to both objectives. It can be seen that the objectives are conflicting with each other, especially when yrefywsimilar-tosubscript𝑦𝑟𝑒𝑓subscript𝑦𝑤y_{ref}\sim y_{w}italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ∼ italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, as one objective is maximizing the log probability and the other is minimizing the log probability. This means that the objectives are at least partly conflicting. For a multi-objective problem, Miettinen (1999) show that optimizing one objective and converting the other objective/s as a constraint with an upper bound, the solution to this ϵconstraineditalic-ϵ𝑐𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑒𝑑\epsilon-constraineditalic_ϵ - italic_c italic_o italic_n italic_s italic_t italic_r italic_a italic_i italic_n italic_e italic_d problem is Pareto optimal. This shows that optimizing the TPO objective, which is a bi-objective problem, gives an optimal policy that is Pareto optimal as defined in 3.2.

Model Align MT-Bench BB-causal BB-sports BB-formal OpenBookQA
Mistral SFT 5.94 51.57 61.76 51.4 43.8
Mistral+SFT CPO 6.2 49.47 70.68 51.07 44.6
Mistral+SFT DPO 6.64 52.1 71.9 51 46.2
Mistral+SFT IPO 6.43 51.57 65.01 51.22 44.6
Mistral+SFT KTO 6.48 53.68 73.42 51.33 45.8
Mistral ORPO 5.47 54.21 73.93 50.4 44.4
Mistral+SFT TPO (our) 6.66 54.21 73.93 50.84 45.6
Mistral TPO (our α=1𝛼1\alpha=1italic_α = 1 | β=0.1𝛽0.1\beta=0.1italic_β = 0.1) 6.22 55.26 73.63 51.06 48.2
Mistral TPO (our α=0.9𝛼0.9\alpha=0.9italic_α = 0.9 | β=0.2𝛽0.2\beta=0.2italic_β = 0.2) 6.66 56.31 73.32 50.5 47.8
Table 2: In our comparison of TPO with other alignment methods across more benchmarks, Mistral+SFT+TPO and Mistral+TPO emerge as the top performer, surpassing other methods in MT-Bench and BB-causal, BB-sports, OpenBookQA. For BB-causal, BB-sports, BB-formal, and OpenBookQA, performance is evaluated based on accuracy, while MT-Bench uses a scoring system generated by GPT-4 that ranges from 0 to 10.
Refer to caption
Figure 3: This figure displays the performance of Mistral+TPO across various settings of α𝛼\alphaitalic_α and β𝛽\betaitalic_β. In several configurations, Mistral+TPO outperforms SFT on the Open LLM Leaderboard benchmarks. Further discussion is provided in Section 4.3.

4 Experiments and Results

In this section, we present a comprehensive empirical analysis of TPO, yielding several key findings: 1) Phi-2+TPO and Mistral+TPO trained on 10K data outperform Phi-2+SFT and Mistral+SFT trained on 200K data by 12.7% and 7.2% on MT-Bench respectively. 2) Phi-2 fine-tuned with TPO surpasses the performance of models aligned with other methods on the MT-Bench. 3) Similarly, Mistral fine-tuned with TPO exceeds the performance of other alignment techniques across the majority of Open LLM Benchmarks. 4) Within the TPO method, the hyper-parameters α𝛼\alphaitalic_α and β𝛽\betaitalic_β play a critical role in influencing performance outcomes. 5) An ablation study focusing on batch size adjustments reveals that enlarging the batch size leads to improved performance for models optimized with TPO.

4.1 Experimental Setup

Models.

All experiments were conducted using zephyr-sft-full and Mistral-7B-v0.1 as Mistral (7 B), and Phi-2 (2.7 B) Javaheripi et al. (2023). We utilized the Transformer Reinforcement Learning (TRL) library for fine-tuning von Werra et al. (2020). It’s noted that the notation "+" is used to indicate that a model has been fine-tuned with a specific algorithm, such as "+TPO". Further training details for each method are in Appendix B.

Datasets.

In this study, we employ two dialogue datasets: 1) UltraChat Ding et al. (2023) and 2) UltraFeedback Cui et al. (2023). UltraChat comprises 200k examples generated by GPT-3.5-TURBO across 30 topics and 20 text material types, offering a high-quality dataset utilized for training the SFT model. Meanwhile, UltraFeedback consists of a 64K set of responses generated by state-of-the-art models such as LLaMA-2 evaluated by a teacher model such as GPT-4. To train TPO, which requires three preferences, we create a custom dataset from the UltraFeedback dataset. Here, the response with the highest score serves as the reference response, the second-highest score as the chosen response, and the lowest score as the rejected response. In light of findings from Saeidi et al. (2024), which indicate that alignment methods perform better with smaller training sets on one epoch, and due to computational limitations, we restrict our analysis to 12K (10K for training and 2K for evaluation) data points, randomly selected from the custom UltraFeedback dataset (More details in Appendix B).

Model Alignment Method
+SFT +SFT+DPO +SFT+IPO +SFT+KTO +SFT+CPO +ORPO +TPO
Phi-2 5.42 6.06 5.91 6.64 6.42 6.06 6.69
Table 3: The comparison of Phi-2’s performance when aligned with various methods on MT-Bench shows that Phi-2+TPO surpasses other alignment techniques.

Evaluation.

We evaluate our models in both single-turn and multi-turn scenarios using the MT-Bench benchmark Ding et al. (2023). MT-Bench is composed of 160 questions covering eight different knowledge domains, designed to be evaluated by GPT-4. To have a comprehensive evaluation we assess all alignment methods using five Open LLM Leaderboard benchmarks including ARC Clark et al. (2018), HellaSwag Zellers et al. (2019), MMLU Hendrycks et al. (2021), Truthful QA Lin et al. (2022), and Winogrande Sakaguchi et al. (2019). We further explore the performance of the models by evaluating them on four benchmarks from Big Bench bench authors (2023), including Causal Judgment (causal reasoning), Sports Understanding (commonsense reasoning), Formal Fallacies, and OpenBookQA Mihaylov et al. (2018).

4.2 Demonstration of TPO Performance

We evaluate the TPO approach against other alignment techniques, such as KTO, IPO, CPO, DPO, and ORPO, using MT-Bench and the Open LLM Leaderboard Benchmarks. Our comparison involves two distinct model configurations: 1) the alignment of an SFT model using TPO and various other alignment methods, and 2) applying TPO directly to fine-tune a pre-trained model. Across all alignment approaches, we utilized Phi-2 (2.7 B) and Mistral (7 B) as the baseline models (More details in Appendix B).

MT-Bench.

The data presented in Table 3 reveals that the Phi-2+TPO method outperforms other alignment techniques, enhancing the MT-Bench score by 12.7% and 7.2% over Phi-2+SFT+DPO and Phi-2+SFT, respectively. Remarkably, Phi-2+TPO achieves this superior performance even when trained on just 10K data, in stark contrast to Phi-2+SFT’s training on 200K data (See Table 3). Additionally, the results in Table 2 demonstrate that Mistral+TPO surpasses competing alignment methods in MT-Bench scores. Mistral+TPO trained on 10K data shows a 7.2% improvement over Mistral+SFT, which is trained on 200K data.

The results in Tables 2 and 5 demonstrate that TPO exceeds the performance of other alignment methods, inspite of the SFT step being skipped (See Appendix C.1). Furthermore, additional experiments show that TPO achieves greater improvements over DPO, KTO, IPO, and CPO by 13.3%, 13.6%, 2.5%, and 13.3% respectively, on SFT trained on 10K data (See Appendix C.2).

Open LLM Leaderboard Benchmarks.

The primary findings, as detailed in Table 1, highlight that Mistral+SFT+TPO, on average, surpasses other alignment methods. This superior performance is largely attributed to its notable success in the TruthfulQA benchmark despite lagging behind Mistral+SFT+DPO in performance. An intriguing observation from the data is that Mistral+TPO not only excels on average but also leads in performance across all benchmarks, showcasing the effectiveness of the TPO strategy. Specifically, Mistral+TPO achieved average accuracy improvements over Mistral+SFT, Mistral+SFT+DPO, Mistral+SFT+IPO, Mistral+SFT+KTO, Mistral+SFT+CPO, and Mistral+ORPO by 4.97%, 4.27%, 5.37%, 4.07%, 4.07%, and 2.35%, respectively. For additional results, readers are directed to Appendix D.

Refer to caption
Figure 4: The MT-Bench score for various α𝛼\alphaitalic_α and β𝛽\betaitalic_β settings in Mistral+TPO illustrates the influence of α𝛼\alphaitalic_α on performance.

Exploration on More Benchmarks.

For a comprehensive evaluation, we assessed the efficacy of the TPO method against various alignment strategies across different benchmarks: BB-causal, BB-sports, BB-formal, and OpenBookQA. As detailed in Table 2, Mistral+SFT+TPO exhibited superior performance on BB-causal and BB-sports benchmarks, while it showed less impressive results on BB-formal and OpenBookQA. Notably, Mistral+TPO not only enhanced the Mistral+SFT+TPO’s outcomes on BB-causal and OpenBookQA but also surpassed Mistral+SFT, Mistral+SFT+DPO, Mistral+SFT+IPO, Mistral+SFT+KTO, Mistral+SFT+CPO, and Mistral+ORPO in accuracy by 4.81%, 1.71%, 3.91%, 1.01%, 3.01%, and 1.3%, respectively. Additional results can be found in Appendix D.

4.3 Ablation Studies

In this subsection, we delve into the impact of α𝛼\alphaitalic_α and β𝛽\betaitalic_β values, batch size, and learning rate on the performance of the TPO method. Central to our exploration is the TPO method’s ability to bypass the SFT stage, thereby assessing its efficacy without this component. Our evaluation focuses on the MT-Bench score and the Open LLM Leaderboard benchmarks to gauge the models’ performance.

Impact of α𝛼\alphaitalic_α and β𝛽\betaitalic_β.

Alpha and Beta serve as crucial hyper-parameters that simultaneously enhance the likelihood of the correct response and refine preference learning. Figure 4 illustrates that the Mistral+TPO model, when set with α𝛼\alphaitalic_α=0.9 and β𝛽\betaitalic_β=0.2, outperforms alternatives in terms of performance on the MT-Bench. Additionally, Figure 3 highlights that Mistral+TPO notably excels in the Open LLM Leaderboard benchmarks, boasting an average accuracy performance increase of 5.12% over the SFT method.

Other hyper-parameters.

We extend our analysis to examine the influence of various hyperparameters on the TPO’s efficacy, including different epochs, learning rates, and batch sizes, specifically with the Mistral+TPO model. We discovered that the learning rate is particularly critical when dealing with smaller datasets; a change by two orders of magnitude prevented the model from converging. Additionally, while different batch sizes do affect performance, there’s a threshold beyond which performance plateaus and no longer benefits from increases. Interestingly, we observed that Mistral+TPO, when trained on 10K data, tends to overfit after just one epoch, with additional epochs failing to enhance performance. Nonetheless, we hypothesize that performance improves with larger datasets beyond the initial epoch, as detailed further in Appendix E.

5 Conclusions

In this paper, we begin by addressing the limitations inherent in existing alignment methods. Typically, alignment techniques require an SFT component to achieve notable results. However, incorporating SFT introduces two primary challenges: firstly, fine-tuning a model using SFT demands a substantial dataset (for example, completing a chat task may require fine-tuning with 200K data points). Secondly, generating a preferences dataset by sampling from the SFT model poses additional difficulties, including determining the optimal configuration for producing preferred and less preferred responses. To mitigate these shortcomings, we introduce TPO, a new alignment approach aimed at concurrently optimizing for human preferences and gold responses. Our findings demonstrate the impressive performance of TPO compared to other alignment methods on ten benchmarks. Particularly, Mistral and Phi-2 fine-tuned by TPO achieve increases in the MT-Bench score of +0.72 and +1.27, respectively, compared to SFT, despite being trained on a dataset six times smaller. Another intriguing insight is the significant influence that the values of α𝛼\alphaitalic_α and β𝛽\betaitalic_β have on the model’s performance.

6 Limitations and Future Works

While TPO has demonstrated impressive performance compared to other alignment methods across various benchmarks, the requirement to prepare three preferences for each input in a dataset poses challenges. In this section, we outline potential directions for future work. Our evaluation of TPO focused on chat completion tasks, but we are particularly interested in examining its effectiveness in other areas, such as safety and reasoning. Another intriguing aspect for further study is investigating how the quality of reference and preferred responses affects TPO’s performance. Notably, our current findings suggest that the reference response is generally better than the preferred response. Investigating whether increasing the preferential difference between these responses enhances performance could yield valuable insights. Additionally, we are interested in exploring TPO’s effectiveness in larger models, such as those with 30 B or 70 B, which represents a promising avenue for future work. Drawing inspiration from the new method proposed in Chatterjee et al. (2024) for fine-tuning diffusion models, we are keen to investigate how these models perform when aligned using the TPO method.

Acknowledgements

We thank the anonymous reviewers for constructive suggestions and the Research Computing (RC) at Arizona State University (ASU) for providing computing resources for experiments. We acknowledge support by a 2023 Spring Amazon Research Award (ARA).

References

  • Anil et al. (2023) Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yan** Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yu**g Zhang, Gustavo Hernandez Abrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan Botha, James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, Christopher A. Choquette-Choo, Aakanksha Chowdhery, Clément Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz, Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur-Ari, Steven Hand, Hadi Hashemi, Le Hou, Joshua Howland, Andrea Hu, Jeffrey Hui, Jeremy Hurwitz, Michael Isard, Abe Ittycheriah, Matthew Jagielski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun, Sneha Kudugunta, Chang Lan, Katherine Lee, Benjamin Lee, Eric Li, Music Li, Wei Li, YaGuang Li, Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu, Frederick Liu, Marcello Maggioni, Aroma Mahendru, Joshua Maynez, Vedant Misra, Maysam Moussalem, Zachary Nado, John Nham, Eric Ni, Andrew Nystrom, Alicia Parrish, Marie Pellat, Martin Polacek, Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter, Parker Riley, Alex Castro Ros, Aurko Roy, Brennan Saeta, Rajkumar Samuel, Renee Shelby, Ambrose Slone, Daniel Smilkov, David R. So, Daniel Sohn, Simon Tokumine, Dasha Valter, Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang, Pidong Wang, Zirui Wang, Tao Wang, John Wieting, Yuhuai Wu, Kelvin Xu, Yunhan Xu, Linting Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Zheng, Ce Zheng, Weikang Zhou, Denny Zhou, Slav Petrov, and Yonghui Wu. 2023. Palm 2 technical report.
  • Azar et al. (2023) Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. 2023. A general theoretical paradigm to understand learning from human preferences.
  • Bai et al. (2022) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback.
  • bench authors (2023) BIG bench authors. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  • Bong and Rinaldo (2022) Heejong Bong and Alessandro Rinaldo. 2022. Generalized results for the existence and consistency of the mle in the bradley-terry-luce model.
  • Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners.
  • Bubeck et al. (2023) Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4.
  • Chatterjee et al. (2024) Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, et al. 2024. Getting it right: Improving spatial consistency in text-to-image models. arXiv preprint arXiv:2404.01197.
  • Christiano et al. (2023) Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2023. Deep reinforcement learning from human preferences.
  • Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge.
  • Cui et al. (2023) Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. 2023. Ultrafeedback: Boosting language models with high-quality feedback.
  • Ding et al. (2023) Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. Enhancing chat language models by scaling high-quality instructional conversations.
  • Ethayarajh et al. (2023) Kawin Ethayarajh, Winnie Xu, Dan Jurafsky, and Douwe Kiela. 2023. Human-aware loss functions (halos). Technical report, Contextual AI.
  • Ganguli et al. (2022) Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran-Johnson, Dario Amodei, Tom Brown, Nicholas Joseph, Sam McCandlish, Chris Olah, Jared Kaplan, and Jack Clark. 2022. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.
  • Hejna et al. (2023) Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W Bradley Knox, and Dorsa Sadigh. 2023. Contrastive prefence learning: Learning from human feedback without rl. arXiv preprint arXiv:2310.13639.
  • Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding.
  • Hong et al. (2024) Jiwoo Hong, Noah Lee, and James Thorne. 2024. Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691.
  • Hu et al. (2021) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models.
  • Javaheripi et al. (2023) Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, et al. 2023. Phi-2: The surprising power of small language models. Microsoft Research Blog.
  • Kreutzer et al. (2018) Julia Kreutzer, Joshua Uyheng, and Stefan Riezler. 2018. Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning.
  • Lin et al. (2022) Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Truthfulqa: Measuring how models mimic human falsehoods.
  • Liu et al. (2024) Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J. Liu, and Jialu Liu. 2024. Statistical rejection sampling improves preference optimization.
  • Lotov and Miettinen (2008) Alexander V. Lotov and Kaisa Miettinen. 2008. Visualizing the Pareto Frontier, pages 213–243. Springer Berlin Heidelberg, Berlin, Heidelberg.
  • Miettinen (1999) Kaisa Miettinen. 1999. Nonlinear multiobjective optimization, volume 12. Springer Science & Business Media.
  • Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP.
  • Narayanan et al. (2021) Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient large-scale language model training on gpu clusters using megatron-lm.
  • Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback.
  • Perez et al. (2022) Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red teaming language models with language models.
  • Rafailov et al. (2023) Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.
  • Ramamurthy et al. (2023) Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kianté Brantley, Jack Hessel, Rafet Sifa, Christian Bauckhage, Hannaneh Hajishirzi, and Ye** Choi. 2023. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization.
  • Saeidi et al. (2024) Amir Saeidi, Shivanshu Verma, and Chitta Baral. 2024. Insights into alignment: Evaluating dpo and its variants across multiple tasks. arXiv preprint arXiv:2404.14723.
  • Sakaguchi et al. (2019) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Ye** Choi. 2019. Winogrande: An adversarial winograd schema challenge at scale.
  • Sanh et al. (2022) Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Tali Bers, Stella Biderman, Leo Gao, Thomas Wolf, and Alexander M. Rush. 2022. Multitask prompted training enables zero-shot task generalization.
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms.
  • Stiennon et al. (2022) Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. 2022. Learning to summarize from human feedback.
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language models.
  • Tunstall et al. (2023) Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, and Thomas Wolf. 2023. Zephyr: Direct distillation of lm alignment.
  • Tversky and Kahneman (1992) Amos Tversky and Daniel Kahneman. 1992. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and uncertainty, 5:297–323.
  • von Werra et al. (2020) Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, and Shengyi Huang. 2020. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl.
  • Wu et al. (2023) Tianhao Wu, Banghua Zhu, Ruoyu Zhang, Zhao** Wen, Kannan Ramchandran, and Jiantao Jiao. 2023. Pairwise proximal policy optimization: Harnessing relative feedback for llm alignment.
  • Xu et al. (2024) Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young ** Kim. 2024. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. arXiv preprint arXiv:2401.08417.
  • Yuan et al. (2023) Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. 2023. Rrhf: Rank responses to align language models with human feedback without tears.
  • Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Ye** Choi. 2019. Hellaswag: Can a machine really finish your sentence?
  • Zhao et al. (2023) Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J. Liu. 2023. Slic-hf: Sequence likelihood calibration with human feedback.
  • Ziegler et al. (2019) Daniel M. Ziegler, Nisan Stiennon, Jeff Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. 2019. Fine-tuning language models from human preferences. ArXiv, abs/1909.08593.

Appendix

Appendix A Derivation

A.1 Deriving the optimal policy under the Preference Objective

In this section, we derive the optimal policy achieved by optimizing the objective in Equation 4. For a given prompt x𝑥xitalic_x, the objective can be analogously written as follows:

maxπ𝔼yπ(y|x)[r(x,y)βlogπ(y|x)]s.t.yπ(y|x)=1formulae-sequencesubscript𝜋subscript𝔼similar-to𝑦𝜋conditional𝑦𝑥delimited-[]𝑟𝑥𝑦𝛽𝜋conditional𝑦𝑥𝑠𝑡subscript𝑦𝜋conditional𝑦𝑥1\begin{split}\max_{\pi}\;\mathbb{E}_{y\sim\pi(y|x)}\left[r(x,y)-\beta\log\pi(y% |x)\right]s.t.\sum_{y}\pi(y|x)=1\end{split}start_ROW start_CELL roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π ( italic_y | italic_x ) end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) - italic_β roman_log italic_π ( italic_y | italic_x ) ] italic_s . italic_t . ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_π ( italic_y | italic_x ) = 1 end_CELL end_ROW

Next, we form a lagrangian for the above objective with λ𝜆\lambdaitalic_λ being the lagrangian multiplier.

=yπ(y|x)r(x,y)β[yπ(y|x)logπ(y|x)]λ[1yπ(y|x)]subscript𝑦𝜋conditional𝑦𝑥𝑟𝑥𝑦𝛽delimited-[]subscript𝑦𝜋conditional𝑦𝑥𝜋conditional𝑦𝑥𝜆delimited-[]1subscript𝑦𝜋conditional𝑦𝑥\begin{split}\mathcal{L}=\sum_{y}\pi(y|x)r(x,y)-\beta\bigg{[}\sum_{y}\pi(y|x)% \log\pi(y|x)\bigg{]}-\lambda\bigg{[}1-\sum_{y}\pi(y|x)\bigg{]}\end{split}start_ROW start_CELL caligraphic_L = ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_π ( italic_y | italic_x ) italic_r ( italic_x , italic_y ) - italic_β [ ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_π ( italic_y | italic_x ) roman_log italic_π ( italic_y | italic_x ) ] - italic_λ [ 1 - ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_π ( italic_y | italic_x ) ] end_CELL end_ROW

Differentiating \mathcal{L}caligraphic_L with respect to π(y|x)𝜋conditional𝑦𝑥\pi(y|x)italic_π ( italic_y | italic_x ) results in,

π(y|x)=r(x,y)β[logπ(y|x)+1]λsubscript𝜋conditional𝑦𝑥𝑟𝑥𝑦𝛽delimited-[]𝜋conditional𝑦𝑥1𝜆\begin{split}\frac{\partial\mathcal{L}}{\partial_{\pi(y|x)}}=r(x,y)-\beta\bigg% {[}\log\pi(y|x)+1\bigg{]}-\lambda\end{split}start_ROW start_CELL divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ start_POSTSUBSCRIPT italic_π ( italic_y | italic_x ) end_POSTSUBSCRIPT end_ARG = italic_r ( italic_x , italic_y ) - italic_β [ roman_log italic_π ( italic_y | italic_x ) + 1 ] - italic_λ end_CELL end_ROW

To obtain the optimal policy, we can set the above equation to zero and solve for π(y|x)𝜋conditional𝑦𝑥\pi(y|x)italic_π ( italic_y | italic_x ).

r(x,y)β[logπ(y|x)+1]λ=0𝑟𝑥𝑦𝛽delimited-[]𝜋conditional𝑦𝑥1𝜆0\begin{split}r(x,y)-\beta\bigg{[}\log\pi(y|x)+1\bigg{]}-\lambda=0\end{split}start_ROW start_CELL italic_r ( italic_x , italic_y ) - italic_β [ roman_log italic_π ( italic_y | italic_x ) + 1 ] - italic_λ = 0 end_CELL end_ROW
logπ(y|x)=1βr(x,y)λβ1𝜋conditional𝑦𝑥1𝛽𝑟𝑥𝑦𝜆𝛽1\begin{split}\log\pi(y|x)=\frac{1}{\beta}r(x,y)-\frac{\lambda}{\beta}-1\end{split}start_ROW start_CELL roman_log italic_π ( italic_y | italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) - divide start_ARG italic_λ end_ARG start_ARG italic_β end_ARG - 1 end_CELL end_ROW
π(y|x)=exp(1βr(x,y)).exp(λβ1)formulae-sequence𝜋conditional𝑦𝑥1𝛽𝑟𝑥𝑦𝜆𝛽1\begin{split}\pi(y|x)=\exp{(\frac{1}{\beta}r(x,y))}.\exp{(\frac{-\lambda}{% \beta}-1)}\end{split}start_ROW start_CELL italic_π ( italic_y | italic_x ) = roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) . roman_exp ( divide start_ARG - italic_λ end_ARG start_ARG italic_β end_ARG - 1 ) end_CELL end_ROW

Since yπ(y|x)=1subscript𝑦𝜋conditional𝑦𝑥1\sum_{y}\pi(y|x)=1∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_π ( italic_y | italic_x ) = 1, the second exponent is a partition function that does normalization as shown below:

[yexp(1βr(x,y))].exp(λβ1)=1formulae-sequencedelimited-[]subscript𝑦1𝛽𝑟𝑥𝑦𝜆𝛽11\begin{split}\bigg{[}\sum_{y}\exp{(\frac{1}{\beta}r(x,y))}\bigg{]}.\exp{(\frac% {-\lambda}{\beta}-1)}=1\end{split}start_ROW start_CELL [ ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) ] . roman_exp ( divide start_ARG - italic_λ end_ARG start_ARG italic_β end_ARG - 1 ) = 1 end_CELL end_ROW
exp(λβ1)=[yexp(1βr(x,y))]1𝜆𝛽1superscriptdelimited-[]subscript𝑦1𝛽𝑟𝑥𝑦1\begin{split}\exp{(\frac{-\lambda}{\beta}-1)}=\bigg{[}\sum_{y}\exp{(\frac{1}{% \beta}r(x,y))}\bigg{]}^{-1}\end{split}start_ROW start_CELL roman_exp ( divide start_ARG - italic_λ end_ARG start_ARG italic_β end_ARG - 1 ) = [ ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL end_ROW

Hence, the partition function Z(x)=yexp(1βr(x,y))𝑍𝑥subscript𝑦1𝛽𝑟𝑥𝑦Z(x)=\sum_{y}\exp{(\frac{1}{\beta}r(x,y))}italic_Z ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) and the optimal policy πr(y|x)subscript𝜋𝑟conditional𝑦𝑥\pi_{r}(y|x)italic_π start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_y | italic_x ) induced by reward function r(x,y)𝑟𝑥𝑦r(x,y)italic_r ( italic_x , italic_y ) is therefore given by,

πr(y|x)=1Z(x)exp(1βr(x,y))subscript𝜋𝑟conditional𝑦𝑥1𝑍𝑥1𝛽𝑟𝑥𝑦\begin{split}\pi_{r}(y|x)=\frac{1}{Z(x)}\exp{(\frac{1}{\beta}r(x,y))}\end{split}start_ROW start_CELL italic_π start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_y | italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_Z ( italic_x ) end_ARG roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) end_CELL end_ROW (1)

Now, we can express the reward function in terms of an optimal policy πrsubscript𝜋𝑟\pi_{r}italic_π start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT by performing some algebraic transformations on Equation 1 as shown below,

πr(y|x).Z(x)=exp(1βr(x,y))formulae-sequencesubscript𝜋𝑟conditional𝑦𝑥𝑍𝑥1𝛽𝑟𝑥𝑦\begin{split}\pi_{r}(y|x).Z(x)=\exp{(\frac{1}{\beta}r(x,y))}\end{split}start_ROW start_CELL italic_π start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_y | italic_x ) . italic_Z ( italic_x ) = roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) end_CELL end_ROW

Taking logarithm and multiplying by β𝛽\betaitalic_β on both sides,

r(x,y)=βlogπr(y|x)+βlogZ(x)𝑟𝑥𝑦𝛽subscript𝜋𝑟conditional𝑦𝑥𝛽𝑍𝑥\begin{split}r(x,y)=\beta\log\pi_{r}(y|x)+\beta\log Z(x)\end{split}start_ROW start_CELL italic_r ( italic_x , italic_y ) = italic_β roman_log italic_π start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_y | italic_x ) + italic_β roman_log italic_Z ( italic_x ) end_CELL end_ROW (2)

A.2 Deriving the Gradient of the TPO Objective

In this section, we derive the gradient of the TPO objective:

θTPO=θ𝔼(x,yref,yw,yl)𝒟[αlogπθ(yref|x)+logσ(βlogπθ(yw|x)βlogπθ(yl|x))]subscript𝜃subscriptTPOsubscript𝜃subscript𝔼similar-to𝑥subscript𝑦𝑟𝑒𝑓subscript𝑦𝑤subscript𝑦𝑙𝒟delimited-[]𝛼subscript𝜋𝜃conditionalsubscript𝑦𝑟𝑒𝑓𝑥𝜎𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥\nabla_{\theta}\mathcal{L}_{\text{TPO}}=-\nabla_{\theta}\mathds{E}_{(x,y_{ref}% ,y_{w},y_{l})\sim\mathcal{D}}\;[\;\alpha\log\pi_{\theta}(y_{ref}|x)+\log\sigma% (\beta\log\pi_{\theta}(y_{w}|x)-\beta\log\pi_{\theta}(y_{l}|x))\;]∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT TPO end_POSTSUBSCRIPT = - ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ italic_α roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT | italic_x ) + roman_log italic_σ ( italic_β roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) - italic_β roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) ) ] (1)

We can rewrite the RHS of the Equation 1 as

θTPO=𝔼(x,yref,yw,yl)𝒟[αθlogπθ(yref|x)(a)+θlogσ(βlogπθ(yw|x)βlogπθ(yl|x))(b)]subscript𝜃subscriptTPOsubscript𝔼similar-to𝑥subscript𝑦𝑟𝑒𝑓subscript𝑦𝑤subscript𝑦𝑙𝒟delimited-[]subscript𝛼subscript𝜃subscript𝜋𝜃conditionalsubscript𝑦𝑟𝑒𝑓𝑥(a)subscriptsubscript𝜃𝜎𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥(b)\nabla_{\theta}\mathcal{L}_{\text{TPO}}=-\mathds{E}_{(x,y_{ref},y_{w},y_{l})% \sim\mathcal{D}}\;[\;\underbrace{\alpha\nabla_{\theta}\log\pi_{\theta}(y_{ref}% |x)}_{\text{(a)}}+\underbrace{\nabla_{\theta}\log\sigma(\beta\log\pi_{\theta}(% y_{w}|x)-\beta\log\pi_{\theta}(y_{l}|x))}_{\text{(b)}}\;]∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT TPO end_POSTSUBSCRIPT = - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ under⏟ start_ARG italic_α ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT | italic_x ) end_ARG start_POSTSUBSCRIPT (a) end_POSTSUBSCRIPT + under⏟ start_ARG ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_σ ( italic_β roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) - italic_β roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) ) end_ARG start_POSTSUBSCRIPT (b) end_POSTSUBSCRIPT ] (2)

In equation 2, the part (b) can be rewritten with

u=βlogπθ(yw|x)βlogπθ(yl|x)𝑢𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥u=\beta\log\pi_{\theta}(y_{w}|x)-\beta\log\pi_{\theta}(y_{l}|x)italic_u = italic_β roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) - italic_β roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x )
θlogσ(u)=1σ(u)θσ(u)subscript𝜃𝜎𝑢1𝜎𝑢subscript𝜃𝜎𝑢\nabla_{\theta}\log\sigma(u)=\frac{1}{\sigma(u)}\nabla_{\theta}\sigma(u)∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_σ ( italic_u ) = divide start_ARG 1 end_ARG start_ARG italic_σ ( italic_u ) end_ARG ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_σ ( italic_u )
θlogσ(u)=σ(u)σ(u)θ(u)subscript𝜃𝜎𝑢superscript𝜎𝑢𝜎𝑢subscript𝜃𝑢\nabla_{\theta}\log\sigma(u)=\frac{\sigma^{{}^{\prime}}(u)}{\sigma(u)}\nabla_{% \theta}(u)∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_σ ( italic_u ) = divide start_ARG italic_σ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_u ) end_ARG start_ARG italic_σ ( italic_u ) end_ARG ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u )

Using the properties of sigmoid function function σ(u)=σ(u)(1σ(u)\sigma^{{}^{\prime}}(u)=\sigma(u)(1-\sigma(u)italic_σ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_u ) = italic_σ ( italic_u ) ( 1 - italic_σ ( italic_u ) and σ(u)=1σ(u)𝜎𝑢1𝜎𝑢\sigma(-u)=1-\sigma(u)italic_σ ( - italic_u ) = 1 - italic_σ ( italic_u ),

θlogσ(u)=σ(u)(1σ(u))σ(u)θ(u)subscript𝜃𝜎𝑢𝜎𝑢1𝜎𝑢𝜎𝑢subscript𝜃𝑢\nabla_{\theta}\log\sigma(u)=\frac{\sigma(u)(1-\sigma(u))}{\sigma(u)}\nabla_{% \theta}(u)∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_σ ( italic_u ) = divide start_ARG italic_σ ( italic_u ) ( 1 - italic_σ ( italic_u ) ) end_ARG start_ARG italic_σ ( italic_u ) end_ARG ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u )
θlogσ(u)=(1σ(u))θ(u)subscript𝜃𝜎𝑢1𝜎𝑢subscript𝜃𝑢\nabla_{\theta}\log\sigma(u)=(1-\sigma(u))\nabla_{\theta}(u)∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_σ ( italic_u ) = ( 1 - italic_σ ( italic_u ) ) ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u )
θlogσ(u)=σ(u)θ(u)subscript𝜃𝜎𝑢𝜎𝑢subscript𝜃𝑢\nabla_{\theta}\log\sigma(u)=\sigma(-u)\nabla_{\theta}(u)∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_σ ( italic_u ) = italic_σ ( - italic_u ) ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u )
θlogσ(u)=βσ(βlogπθ(yl|x)βlogπθ(yw|x))[θlogπ(yw|x)θlogπ(yl|x)]subscript𝜃𝜎𝑢𝛽𝜎𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥delimited-[]subscript𝜃𝜋conditionalsubscript𝑦𝑤𝑥subscript𝜃𝜋conditionalsubscript𝑦𝑙𝑥\nabla_{\theta}\log\sigma(u)=\beta\sigma(\beta\log\pi_{\theta}(y_{l}|x)-\beta% \log\pi_{\theta}(y_{w}|x))\;[\nabla_{\theta}\log\pi(y_{w}|x)-\nabla_{\theta}% \log\pi(y_{l}|x)]∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_σ ( italic_u ) = italic_β italic_σ ( italic_β roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) - italic_β roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) ) [ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) - ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) ] (3)

Plugging Equation 3 into Equation 2 we get,

θTPO=subscript𝜃subscriptTPOabsent\displaystyle\nabla_{\theta}\mathcal{L}_{\text{TPO}}=∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT TPO end_POSTSUBSCRIPT = 𝔼(x,yref,yw,yl)𝒟[αθlogπ(yref|x)\displaystyle-\mathds{E}_{(x,y_{ref},y_{w},y_{l})\sim\mathcal{D}}\;[\alpha% \nabla_{\theta}\log\pi(y_{ref}|x)- blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ italic_α ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π ( italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT | italic_x )
+βσ(βlogπθ(yl|x)βlogπθ(yw|x))𝛽𝜎𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥\displaystyle+\beta\sigma(\beta\log\pi_{\theta}(y_{l}|x)-\beta\log\pi_{\theta}% (y_{w}|x))+ italic_β italic_σ ( italic_β roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) - italic_β roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) )
×[θlogπ(yw|x)θlogπ(yl|x)]]\displaystyle\times[\nabla_{\theta}\log\pi(y_{w}|x)-\nabla_{\theta}\log\pi(y_{% l}|x)]]× [ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) - ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) ] ] (4)

A.3 Proof of Lemma

In this section, we will prove the lemmas from Section 3.2.

Lemma 1 Restated.

Under the Plackett-Luce preference framework, and in particular the Bradley-Terry framework, two reward functions from the same equivalence class induce the same preference distribution.

Proof.𝑃𝑟𝑜𝑜𝑓Proof.italic_P italic_r italic_o italic_o italic_f . Let’s consider two reward functions, r(x,y)𝑟𝑥𝑦r(x,y)italic_r ( italic_x , italic_y ) and r(x,y)superscript𝑟𝑥𝑦r^{\prime}(x,y)italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y ). They are said to be equivalent if they can be related by r(x,y)=r(x,y)+g(x)superscript𝑟𝑥𝑦𝑟𝑥𝑦𝑔𝑥r^{\prime}(x,y)=r(x,y)+g(x)italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y ) = italic_r ( italic_x , italic_y ) + italic_g ( italic_x ) for some function g𝑔gitalic_g. We analyze this in the context of the general Plackett-Luce model, which includes the Bradley-Terry model (special case when K=2𝐾2K=2italic_K = 2). Here, we denote the probability distribution over rankings generated by a given reward function r(x,y)𝑟𝑥𝑦r(x,y)italic_r ( italic_x , italic_y ) as prsubscript𝑝𝑟p_{r}italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. Given any prompt x𝑥xitalic_x, responses y1,,yKsubscript𝑦1subscript𝑦𝐾y_{1},...,y_{K}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, and a ranking τ𝜏\tauitalic_τ, we can establish the following:

pr(τy1,,yK,x)subscript𝑝superscript𝑟conditional𝜏subscript𝑦1subscript𝑦𝐾𝑥\displaystyle p_{r^{\prime}}(\tau\mid y_{1},\ldots,y_{K},x)italic_p start_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ∣ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , italic_x ) =k=1Kexp(r(x,yτ(k)))j=kKexp(r(x,yτ(j)))absentsuperscriptsubscriptproduct𝑘1𝐾superscript𝑟𝑥subscript𝑦𝜏𝑘superscriptsubscript𝑗𝑘𝐾superscript𝑟𝑥subscript𝑦𝜏𝑗\displaystyle=\prod_{k=1}^{K}\frac{\exp(r^{\prime}(x,y_{\tau(k)}))}{\sum_{j=k}% ^{K}\exp(r^{\prime}(x,y_{\tau(j)}))}= ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG roman_exp ( italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_τ ( italic_k ) end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_τ ( italic_j ) end_POSTSUBSCRIPT ) ) end_ARG
=k=1Kexp(r(x,yτ(k))+g(x))j=kKexp(r(x,yτ(j))+g(x))absentsuperscriptsubscriptproduct𝑘1𝐾𝑟𝑥subscript𝑦𝜏𝑘𝑔𝑥superscriptsubscript𝑗𝑘𝐾𝑟𝑥subscript𝑦𝜏𝑗𝑔𝑥\displaystyle=\prod_{k=1}^{K}\frac{\exp(r(x,y_{\tau(k)})+g(x))}{\sum_{j=k}^{K}% \exp(r(x,y_{\tau(j)})+g(x))}= ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG roman_exp ( italic_r ( italic_x , italic_y start_POSTSUBSCRIPT italic_τ ( italic_k ) end_POSTSUBSCRIPT ) + italic_g ( italic_x ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( italic_r ( italic_x , italic_y start_POSTSUBSCRIPT italic_τ ( italic_j ) end_POSTSUBSCRIPT ) + italic_g ( italic_x ) ) end_ARG
=k=1Kexp(g(x))exp(r(x,yτ(k)))exp(g(x))j=kKexp(r(x,yτ(j)))absentsuperscriptsubscriptproduct𝑘1𝐾𝑔𝑥𝑟𝑥subscript𝑦𝜏𝑘𝑔𝑥superscriptsubscript𝑗𝑘𝐾𝑟𝑥subscript𝑦𝜏𝑗\displaystyle=\prod_{k=1}^{K}\frac{\exp(g(x))\exp(r(x,y_{\tau(k)}))}{\exp(g(x)% )\sum_{j=k}^{K}\exp(r(x,y_{\tau(j)}))}= ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG roman_exp ( italic_g ( italic_x ) ) roman_exp ( italic_r ( italic_x , italic_y start_POSTSUBSCRIPT italic_τ ( italic_k ) end_POSTSUBSCRIPT ) ) end_ARG start_ARG roman_exp ( italic_g ( italic_x ) ) ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( italic_r ( italic_x , italic_y start_POSTSUBSCRIPT italic_τ ( italic_j ) end_POSTSUBSCRIPT ) ) end_ARG
=k=1Kexp(r(x,yτ(k)))j=kKexp(r(x,yτ(j)))absentsuperscriptsubscriptproduct𝑘1𝐾𝑟𝑥subscript𝑦𝜏𝑘superscriptsubscript𝑗𝑘𝐾𝑟𝑥subscript𝑦𝜏𝑗\displaystyle=\prod_{k=1}^{K}\frac{\exp(r(x,y_{\tau(k)}))}{\sum_{j=k}^{K}\exp(% r(x,y_{\tau(j)}))}= ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG roman_exp ( italic_r ( italic_x , italic_y start_POSTSUBSCRIPT italic_τ ( italic_k ) end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( italic_r ( italic_x , italic_y start_POSTSUBSCRIPT italic_τ ( italic_j ) end_POSTSUBSCRIPT ) ) end_ARG
=pr(τy1,,yK,x),absentsubscript𝑝𝑟conditional𝜏subscript𝑦1subscript𝑦𝐾𝑥\displaystyle=p_{r}(\tau\mid y_{1},\ldots,y_{K},x),= italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_τ ∣ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , italic_x ) ,

This completes the proof.

Lemma 2 Restated.

Two reward functions from the same equivalence class induce the same optimal policy under the constrained RL problem.

Proof.𝑃𝑟𝑜𝑜𝑓Proof.italic_P italic_r italic_o italic_o italic_f . Let’s consider two reward functions, r(x,y)𝑟𝑥𝑦r(x,y)italic_r ( italic_x , italic_y ) and r(x,y)superscript𝑟𝑥𝑦r^{\prime}(x,y)italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y ). They are said to be equivalent if they can be related by r(x,y)=r(x,y)+g(x)superscript𝑟𝑥𝑦𝑟𝑥𝑦𝑔𝑥r^{\prime}(x,y)=r(x,y)+g(x)italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y ) = italic_r ( italic_x , italic_y ) + italic_g ( italic_x ) for some function g𝑔gitalic_g. Let πrsubscript𝜋𝑟\pi_{r}italic_π start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and πrsubscript𝜋superscript𝑟\pi_{r^{{}^{\prime}}}italic_π start_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT be the optimal policies induced by their corresponding reward functions. By Equation 5, for all x,y𝑥𝑦x,yitalic_x , italic_y we have,

πr(yx)subscript𝜋superscript𝑟conditional𝑦𝑥\displaystyle\pi_{r^{\prime}}(y\mid x)italic_π start_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y ∣ italic_x ) =1yexp(1βr(x,y))exp(1βr(x,y))absent1subscript𝑦1𝛽superscript𝑟𝑥𝑦1𝛽superscript𝑟𝑥𝑦\displaystyle=\frac{1}{\sum_{y}\exp\left(\frac{1}{\beta}r^{\prime}(x,y)\right)% }\exp\left(\frac{1}{\beta}r^{\prime}(x,y)\right)= divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y ) ) end_ARG roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y ) )
=1yexp(1β(r(x,y)+g(x)))exp(1β(r(x,y)+g(x)))absent1subscript𝑦1𝛽𝑟𝑥𝑦𝑔𝑥1𝛽𝑟𝑥𝑦𝑔𝑥\displaystyle=\frac{1}{\sum_{y}\exp\left(\frac{1}{\beta}(r(x,y)+g(x))\right)}% \exp\left(\frac{1}{\beta}\big{(}r(x,y)+g(x)\big{)}\right)= divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ( italic_r ( italic_x , italic_y ) + italic_g ( italic_x ) ) ) end_ARG roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ( italic_r ( italic_x , italic_y ) + italic_g ( italic_x ) ) )
=1exp(1βg(x))yexp(1βr(x,y))exp(1βr(x,y))exp(1βg(x))absent11𝛽𝑔𝑥subscript𝑦1𝛽𝑟𝑥𝑦1𝛽𝑟𝑥𝑦1𝛽𝑔𝑥\displaystyle=\frac{1}{\exp\left(\frac{1}{\beta}g(x)\right)\sum_{y}\exp\left(% \frac{1}{\beta}r(x,y)\right)}\exp\left(\frac{1}{\beta}r(x,y)\right)\exp\left(% \frac{1}{\beta}g(x)\right)= divide start_ARG 1 end_ARG start_ARG roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_g ( italic_x ) ) ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) end_ARG roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_g ( italic_x ) )
=1yexp(1βr(x,y))exp(1βr(x,y))absent1subscript𝑦1𝛽𝑟𝑥𝑦1𝛽𝑟𝑥𝑦\displaystyle=\frac{1}{\sum_{y}\exp\left(\frac{1}{\beta}r(x,y)\right)}\exp% \left(\frac{1}{\beta}r(x,y)\right)= divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) end_ARG roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) )
=πr(yx),absentsubscript𝜋𝑟conditional𝑦𝑥\displaystyle=\pi_{r}(y\mid x),= italic_π start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_y ∣ italic_x ) ,

This completes the proof.

A.4 Proof of Theorem

Theorem 1 Restated.

For a parameter β>0𝛽0\beta>0italic_β > 0, all reward equivalence classes can be reparameterized as r(x,y)=βlogπ(y|x)𝑟𝑥𝑦𝛽𝜋conditional𝑦𝑥r(x,y)=\beta\log\pi(y|x)italic_r ( italic_x , italic_y ) = italic_β roman_log italic_π ( italic_y | italic_x ) for some model π(y|x)𝜋conditional𝑦𝑥\pi(y|x)italic_π ( italic_y | italic_x ).

Proof.𝑃𝑟𝑜𝑜𝑓Proof.italic_P italic_r italic_o italic_o italic_f . Consider a reward function r(x,y)𝑟𝑥𝑦r(x,y)italic_r ( italic_x , italic_y ), which induces an optimal model πr(y|x)subscript𝜋𝑟conditional𝑦𝑥\pi_{r}(y|x)italic_π start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_y | italic_x ) under the MERL framework, which takes the form as shown in Eq.5 in Section 3.1. Following, Equation 2 in Section A.1 of Appendix, we have:

r(x,y)=βlogπr(y|x)+βlogZ(x)𝑟𝑥𝑦𝛽subscript𝜋𝑟conditional𝑦𝑥𝛽𝑍𝑥\begin{split}r(x,y)=\beta\log\pi_{r}(y|x)+\beta\log Z(x)\end{split}start_ROW start_CELL italic_r ( italic_x , italic_y ) = italic_β roman_log italic_π start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_y | italic_x ) + italic_β roman_log italic_Z ( italic_x ) end_CELL end_ROW (1)

where Z(x)=yexp(1βr(x,y))𝑍𝑥subscript𝑦1𝛽𝑟𝑥𝑦Z(x)=\sum_{y}\exp{(\frac{1}{\beta}r(x,y))}italic_Z ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) is the partition function of the optimal policy induced by the reward function r(x,y)𝑟𝑥𝑦r(x,y)italic_r ( italic_x , italic_y ). Let r(x,y)superscript𝑟𝑥𝑦r^{{}^{\prime}}(x,y)italic_r start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_y ) be a new reward function such that r(x,y)=r(x,y)βlogZ(x)superscript𝑟𝑥𝑦𝑟𝑥𝑦𝛽𝑍𝑥r^{{}^{\prime}}(x,y)=r(x,y)-\beta\log Z(x)italic_r start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_y ) = italic_r ( italic_x , italic_y ) - italic_β roman_log italic_Z ( italic_x ). It is obvious that the new reward function is within the equivalence class of r𝑟ritalic_r, and the we have:

r(x,y)=r(x,y)βlogZ(x)superscript𝑟𝑥𝑦𝑟𝑥𝑦𝛽𝑍𝑥\begin{split}r^{{}^{\prime}}(x,y)=r(x,y)-\beta\log Z(x)\end{split}start_ROW start_CELL italic_r start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_y ) = italic_r ( italic_x , italic_y ) - italic_β roman_log italic_Z ( italic_x ) end_CELL end_ROW

From the Equation 1, we get

r(x,y)=βlogπr(y|x)+βlogZ(x)βlogZ(x)superscript𝑟𝑥𝑦𝛽subscript𝜋𝑟conditional𝑦𝑥𝛽𝑍𝑥𝛽𝑍𝑥\begin{split}r^{{}^{\prime}}(x,y)=\beta\log\pi_{r}(y|x)+\beta\log Z(x)-\beta% \log Z(x)\end{split}start_ROW start_CELL italic_r start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_y ) = italic_β roman_log italic_π start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_y | italic_x ) + italic_β roman_log italic_Z ( italic_x ) - italic_β roman_log italic_Z ( italic_x ) end_CELL end_ROW
r(x,y)=βlogπr(y|x)superscript𝑟𝑥𝑦𝛽subscript𝜋𝑟conditional𝑦𝑥\begin{split}r^{{}^{\prime}}(x,y)=\beta\log\pi_{r}(y|x)\end{split}start_ROW start_CELL italic_r start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_y ) = italic_β roman_log italic_π start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_y | italic_x ) end_CELL end_ROW

This completes the proof.

Proposition 1.

For a parameter β>0𝛽0\beta>0italic_β > 0, every equivalence class of reward functions has a unique reward function r(x,y)𝑟𝑥𝑦r(x,y)italic_r ( italic_x , italic_y ), which can be reparameterized as r(x,y)=βlogπ(y|x)𝑟𝑥𝑦𝛽𝜋conditional𝑦𝑥r(x,y)=\beta\log\pi(y|x)italic_r ( italic_x , italic_y ) = italic_β roman_log italic_π ( italic_y | italic_x ) for some model π(y|x)𝜋conditional𝑦𝑥\pi(y|x)italic_π ( italic_y | italic_x ).

ProofbyContradiction.𝑃𝑟𝑜𝑜𝑓𝑏𝑦𝐶𝑜𝑛𝑡𝑟𝑎𝑑𝑖𝑐𝑡𝑖𝑜𝑛Proof-by-Contradiction.italic_P italic_r italic_o italic_o italic_f - italic_b italic_y - italic_C italic_o italic_n italic_t italic_r italic_a italic_d italic_i italic_c italic_t italic_i italic_o italic_n . Let us assume that we have two reward functions from the same class, such that r(x,y)=r(x,y)+g(x)superscript𝑟𝑥𝑦𝑟𝑥𝑦𝑔𝑥r^{{}^{\prime}}(x,y)=r(x,y)+g(x)italic_r start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_y ) = italic_r ( italic_x , italic_y ) + italic_g ( italic_x ). Assume that r(x,y)=βlogπ(y|x)superscript𝑟𝑥𝑦𝛽superscript𝜋conditional𝑦𝑥r^{{}^{\prime}}(x,y)=\beta\log\pi^{{}^{\prime}}(y|x)italic_r start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_y ) = italic_β roman_log italic_π start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_y | italic_x ) for some model π(y|x)superscript𝜋conditional𝑦𝑥\pi^{{}^{\prime}}(y|x)italic_π start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_y | italic_x ) and r(x,y)=βlogπ(y|x)𝑟𝑥𝑦𝛽𝜋conditional𝑦𝑥r(x,y)=\beta\log\pi(y|x)italic_r ( italic_x , italic_y ) = italic_β roman_log italic_π ( italic_y | italic_x ) for some model π(y|x)𝜋conditional𝑦𝑥\pi(y|x)italic_π ( italic_y | italic_x ), such that ππsuperscript𝜋𝜋\pi^{{}^{\prime}}\neq\piitalic_π start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ≠ italic_π. We then have,

r(x,y)superscript𝑟𝑥𝑦\displaystyle r^{{}^{\prime}}(x,y)italic_r start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_y ) =r(x,y)+g(x)absent𝑟𝑥𝑦𝑔𝑥\displaystyle=r(x,y)+g(x)= italic_r ( italic_x , italic_y ) + italic_g ( italic_x )
=βlogπ(y|x)+g(x)absent𝛽𝜋conditional𝑦𝑥𝑔𝑥\displaystyle=\beta\log\pi(y|x)+g(x)= italic_β roman_log italic_π ( italic_y | italic_x ) + italic_g ( italic_x )
=βlogπ(y|x)+βlogexp(1βg(x))absent𝛽𝜋conditional𝑦𝑥𝛽1𝛽𝑔𝑥\displaystyle=\beta\log\pi(y|x)+\beta\log\exp{(\frac{1}{\beta}g(x))}= italic_β roman_log italic_π ( italic_y | italic_x ) + italic_β roman_log roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_g ( italic_x ) )
=βlogπ(y|x)exp(1βg(x))absent𝛽𝜋conditional𝑦𝑥1𝛽𝑔𝑥\displaystyle=\beta\log\pi(y|x)\exp{(\frac{1}{\beta}g(x))}= italic_β roman_log italic_π ( italic_y | italic_x ) roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_g ( italic_x ) )
=βlogπ(y|x)absent𝛽superscript𝜋conditional𝑦𝑥\displaystyle=\beta\log\pi^{{}^{\prime}}(y|x)= italic_β roman_log italic_π start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_y | italic_x )

for all prompts x and completions y. Then, we must have π(y|x)exp(1βg(x))=π(y|x)𝜋conditional𝑦𝑥1𝛽𝑔𝑥superscript𝜋conditional𝑦𝑥\pi(y|x)\exp{(\frac{1}{\beta}g(x))}=\pi^{{}^{\prime}}(y|x)italic_π ( italic_y | italic_x ) roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_g ( italic_x ) ) = italic_π start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_y | italic_x ). Since these are probability distributions, summing over y on both sides,

y[π(y|x)exp(1βg(x))]subscript𝑦delimited-[]𝜋conditional𝑦𝑥1𝛽𝑔𝑥\displaystyle\sum_{y}\big{[}\pi(y|x)\exp{(\frac{1}{\beta}g(x))}\big{]}∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT [ italic_π ( italic_y | italic_x ) roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_g ( italic_x ) ) ] =yπ(y|x)absentsubscript𝑦superscript𝜋conditional𝑦𝑥\displaystyle=\sum_{y}\pi^{{}^{\prime}}(y|x)= ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_y | italic_x )
exp(1βg(x))1𝛽𝑔𝑥\displaystyle\exp{(\frac{1}{\beta}g(x))}roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_g ( italic_x ) ) =1absent1\displaystyle=1= 1

Since β>0𝛽0\beta>0italic_β > 0, g(x)𝑔𝑥g(x)italic_g ( italic_x ) must be 0 for all x𝑥xitalic_x. Therefore, we will have r(x,y)=r(x,y)𝑟𝑥𝑦superscript𝑟𝑥𝑦r(x,y)=r^{{}^{\prime}}(x,y)italic_r ( italic_x , italic_y ) = italic_r start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_y ), which contradicts our initial condition of ππsuperscript𝜋𝜋\pi^{{}^{\prime}}\neq\piitalic_π start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ≠ italic_π.

Thus, by contradiction, we have shown that every reward class has a unique reward function that can be represented by the reparameterization in Theorem 3.1.

Appendix B Training and Evaluation Details

All models were trained using the AdamW optimizer without weight decay. Furthermore, parameter-efficient techniques such as LoRA Hu et al. (2021) were not employed. The experiments were conducted on 4 A100 GPUs, utilizing bfloat16 precision, and typically required 5-8 hours to complete. All models are trained for one epoch, employing a linear learning rate scheduler with a peak learning rate of 5e-07 and 10% warmup steps. Additionally, the global batch size is set to 16, and β𝛽\betaitalic_β = 0.1 is used to regulate the deviation from the reference model. For every dataset used in our evaluation, we detail the count of few-shot examples utilized along with the specific metric employed for assessment in Table 4.

Datasets ARC TruthfulQA Winogrande HellaSwag MMLU BB-causal BB-sports BB-formal OpenBookQA
# few-shot 25 0 5 10 5 3 3 3 1
Metric acc_norm mc2 acc acc_norm acc mc mc mc acc_norm
Table 4: Detailed information of Open LLM Leaderboard and Big Bench benchmarks.

The custom UltraFeedback dataset includes yrefsubscript𝑦𝑟𝑒𝑓y_{ref}italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT, ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, and ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT for each input x𝑥xitalic_x. For a fair comparison, when training alignment methods based on the SFT model, we utilized ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT under the assumption that the model was trained on yrefsubscript𝑦𝑟𝑒𝑓y_{ref}italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT during supervised fine-tuning. Conversely, in scenarios where we directly trained a model using alignment methods, we used yrefsubscript𝑦𝑟𝑒𝑓y_{ref}italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT and ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT.

Appendix C More Experiments

In this section, we assess the performance of alignment methods in two distinct scenarios: 1) skip** the SFT component and 2) aligning an SFT model that has been fine-tuned on a dataset of 10K instances using various alignment techniques.

C.1 Skip** the SFT Component

The primary benefit of using TPO is the ability to skip the SFT component, which often results in better performance for TPO without SFT. In this experiment, we also investigate the effectiveness of other alignment methods without the SFT part. For this purpose, we directly trained a Mistral-7B-v0.1 model using various alignment techniques like DPO, KTO, IPO, CPO, and ORPO.

Model Align MT-Bench
Mistral SFT 5.94
Mistral DPO 5.45
Mistral KTO 6.21
Mistral IPO 2.06
Mistral CPO 6.3
Mistral ORPO 5.47
Mistral TPO (our α=0.9𝛼0.9\alpha=0.9italic_α = 0.9 | β=0.2𝛽0.2\beta=0.2italic_β = 0.2) 6.22
Mistral TPO (our α=0.3𝛼0.3\alpha=0.3italic_α = 0.3 | β=0.7𝛽0.7\beta=0.7italic_β = 0.7) 6.61
Mistral TPO (our α=1𝛼1\alpha=1italic_α = 1 | β=0.1𝛽0.1\beta=0.1italic_β = 0.1) 6.66
Table 5: Comparison of the performance of various alignment methods on skip** the SFT part using MT-Bench.

The results in Table 5 indicate that without the SFT component, both DPO and IPO fail to match the performance levels of Mistral+SFT. Additionally, the results for KTO and CPO show negligible differences when compared with SFT. Although ORPO recommends bypassing the SFT phase in the alignment process, it seems that a policy model fine-tuned with ORPO underperforms when only one epoch is used. A comparison between the results in Tables 2 and 5 reveals that most of the alignment methods perform better when the SFT part is retained.

C.2 Aligning an SFT Model with Less Data

In this experiment, we investigate how alignment methods perform when applied to an SFT model trained on significantly less data. TPO utilizes the dataset D={xi,yrefi,ywi,yli}i=1N𝐷subscriptsuperscriptsuperscript𝑥𝑖subscriptsuperscript𝑦𝑖𝑟𝑒𝑓subscriptsuperscript𝑦𝑖𝑤subscriptsuperscript𝑦𝑖𝑙𝑁𝑖1D=\{x^{i},y^{i}_{ref},y^{i}_{w},y^{i}_{l}\}^{N}_{i=1}italic_D = { italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT. Initially, we fine-tune a Mistral-7B-v0.1 model on 10K data, which are designated as yrefsubscript𝑦𝑟𝑒𝑓y_{ref}italic_y start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT for TPO. Subsequently, we applied various alignment methods to this fine-tuned model.

Model (training Size) DPO CPO KTO IPO
+ Mistral+SFT (200K) 6.64 6.2 6.48 6.43
+ Mistral+SFT (10K) 5.33 5.89 5.3 6.41
Table 6: Comparison of the performance of various alignment methods on different SFT models using the MT-Bench. Notably, the score for Mistral+SFT trained on 10K data is 4.2, while the score for Mistral+SFT trained on 200K data is 5.94.

The findings presented in Table 6 suggest that alignment methods yield superior results when applied to an SFT model trained on a larger dataset. It is evident that, when using the same data as for Mistral+TPO, other models perform significantly worse. These results confirm our hypothesis that TPO surpasses other methods with considerably less data.

Appendix D More results on Open LLM Leaderboard and Big Bench Benchmarks

Our assessment of Phi-2 through the Open LLM Leaderboard benchmarks, in comparison with various alignment methods, showed that Phi-2+TPO, trained on a dataset of 10K, achieved performance on par with other alignment strategies across the ARC, TruthfulQA, and MMLU benchmarks. Also, The results showed that this model performs better on BB-causal and OpenBookQA.

Model Align ARC TruthfulQA Winogrande HellaSwag MMLU BB-causal BB-sports BB-formal OpenBookQA
Phi-2 SFT 61 46.01 74.58 74.66 56.48 55.26 51.72 49.54 50.2
Phi-2+SFT DPO 61.34 51.53 74.82 75.88 56.99 57.36 52.63 49.5 52.2
Phi-2+SFT IPO 61.43 49.05 75.05 75.36 56.83 55.26 51.31 49.69 51.2
Phi-2+SFT KTO 61 52.35 74.98 75.43 57.02 56.31 51.62 49.47 51.4
Phi-2+SFT CPO 60.49 53.3 75.05 74.78 56.94 54.21 50.5 49.48 49.8
Phi-2 ORPO 61.17 45.68 74.42 74.69 58.33 55.78 50.7 49.01 52.8
Phi-2+SFT TPO (our) 61.09 53.6 74.82 74.98 56.95 54.21 50.3 49.27 50.6
Phi-2 TPO (our α=1𝛼1\alpha=1italic_α = 1 | β=0.1𝛽0.1\beta=0.1italic_β = 0.1) 61.51 45.41 74.34 75.27 58.38 55.78 51.44 49.28 53.2
Phi-2 TPO (our α=0.9𝛼0.9\alpha=0.9italic_α = 0.9 | β=0.2𝛽0.2\beta=0.2italic_β = 0.2) 61.6 46.21 74.66 74.91 58.12 57.36 51.31 48.35 53.4
Table 7: Comparison between TPO and other alignment methods on Open LLM Leaderboard and Big Bench benchmarks based on Phi-2 model.

Appendix E More results on Ablation Studies

This section presents the performance of Mistral+TPO across various learning rate, epoch, and batch size utilizing the MT-Bench score as the benchmark for assessment.

Model Align Learning Rate Epoch Batch Size First Turn (Score) Second Turn (Score) Average (Score)
Mistral TPO (α𝛼\alphaitalic_α=1|β𝛽\betaitalic_β=0.1) 5e-07 1 16 6.78 5.66 6.22
Mistral TPO (α𝛼\alphaitalic_α=1|β𝛽\betaitalic_β=0.1) 2e-05 1 16 1 1 1
Mistral TPO (α𝛼\alphaitalic_α=0.9|β𝛽\betaitalic_β=0.2) 5e-07 1 16 7.12 6.2 6.66
Mistral TPO (α𝛼\alphaitalic_α=0.9|β𝛽\betaitalic_β=0.2) 5e-07 1 32 6.98 6.1 6.54
Mistral TPO (α𝛼\alphaitalic_α=0.9|β𝛽\betaitalic_β=0.2) 5e-07 2 16 7.2 6 6.61
Table 8: Performance of the Mistral+TPO on different values of hyper-parameters.