Show, Don’t Tell: Aligning Language Models with Demonstrated Feedback

Omar Shaikh  Michelle Lam  Joey Hejna  Yijia Shao & Michael Bernstein  Diyi Yang

Stanford University
Abstract

Language models are aligned to emulate the collective voice of many, resulting in outputs that align with no one in particular. Steering LLMs away from generic output is possible through supervised finetuning or RLHF, but requires prohibitively large datasets for new ad-hoc tasks. We argue that it is instead possible to align an LLM to a specific setting by leveraging a very small number (<10absent10<10< 10) of demonstrations as feedback. Our method, Demonstration ITerated Task Optimization (DITTO), directly aligns language model outputs to a user’s demonstrated behaviors. Derived using ideas from online imitation learning, DITTO cheaply generates online comparison data by treating users’ demonstrations as preferred over output from the LLM and its intermediate checkpoints. We evaluate DITTO’s ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts. Additionally, we conduct a user study soliciting a range of demonstrations from participants (N=16𝑁16N=16italic_N = 16). Across our benchmarks and user study, we find that win-rates for DITTO outperform few-shot prompting, supervised fine-tuning, and other self-play methods by an average of 19% points. By using demonstrations as feedback directly, DITTO offers a novel method for effective customization of LLMs.111Code: https://github.com/SALT-NLP/demonstrated-feedback

$*$$*$footnotetext: Equal contribution.

1 Introduction

Large language models (LLMs) are trained for general-purpose use. In practice, however, they are often applied to very specific tasks for very specific users. Consider a task as simple as writing an email: our preferred email depends on personal writing style, the specific email task, or the target audience (a friend, stranger, etc.). As a result, there can be a mismatch between the universal style [37, 9] trained into an LLM via instruction and preference tuning, and the specific style needed for applications. LLM outputs feel unopinionated and generic because of this mismatch.

While existing approaches such as supervised or preference finetuning are effective, they can require a large corpus of (un)acceptable behavior (on the order of 1Kabsent1𝐾\approx 1K≈ 1 italic_K samples [54, 32]), which in turn requires unreasonably high effort from an individual. RLAIF methods like Constitutional AI [4] automate pairwise preference collection with an LLM, but align models to general principles that may not capture fine-grained preferences. Although prompting is data efficient, finding an effective prompt can be tedious—end-users often rely on brittle prompting heuristics [55, 49]. How might we efficiently communicate preferences and align a language model to a new individual or task?

Refer to caption
Figure 1: DITTO iteratively aligns LLMs to demonstrated behavior. When a user supplies demonstrations (through edits to a model’s output, past preferred interaction history, or writing examples from scratch), DITTO treats these demonstrations as preferred to all model behavior, including earlier iterations of the trained model. Using demonstrations as feedback allows for cheap generation of online comparison data and enables few-shot alignment with just a handful of samples.

This paper introduces a framework for aligning LLMs to specific settings by providing a small number of demonstrations (Fig. 1). Rather than using prompts, principles, or pairwise preferences, we show that we can achieve strong alignment with individuals by leveraging a small number of user-provided examples of desired behavior. These examples can be drawn from a user’s existing interaction logs, or from direct edits made to LLM outputs. Our approach, DITTO, scaffolds a handful of these demonstrations (<10absent10<10< 10) into a substantial dataset of preference comparisons, by treating users’ demonstrations as preferred over model output from both the original LLM and models’ earlier training iterations. This augmented dataset of demonstration-grounded comparisons can then be used to update the language model using an alignment algorithm like DPO [34]. We additionally show that DITTO can be interpreted as an online imitation learning algorithm, where data sampled from the LLM is used to distinguish expert behavior. This perspective allows us to prove that DITTO can extrapolate beyond the performance of the expert (§3).

Since DITTO focuses on user/task-specific alignment, we benchmark DITTO through (1) an evaluation on datasets of author-specific writing (§4.1) and (2) a user evaluation (§4.2) on real-world tasks defined by human participants. Our author-specific datasets include writing from blog posts to emails to articles. We find that win rates for DITTO outperform methods like SFT (avg. 11% pt. increase), self-play methods like SPIN (20.2% pt.), and few-shot prompting (33.4% pt.) on Mistral 7B—even when few-shot prompts are provided to a more powerful LLM (GPT-4, 18% pt.). Next, we conduct a user study (N=16𝑁16N=16italic_N = 16), asking individuals to edit generations from GPT-4 in an email-writing task. We use finalized demonstrations as inputs for DITTO. In these realistic user evaluations, DITTO’s advantage becomes clearer: DITTO continues to outperform baselines, including few-shot prompting (23.9% pt.), user-constructed prompts (27.9% pt.), and SFT (12% pt.). Finally, in a direct comparison between demonstrations and pairwise feedback, we show that using demonstrations with DITTO is an order of magnitude more sample-efficient for individuals than soliciting pairwise preferences.

2 Related Work

LLMs and Preference Finetuning. Large language models trained on vast amounts of data have been known to perform well with careful prompting [7, 44]. Prompting, however, can be incredibly tedious [49] to design and often sensitive to variations. Thus, it has become necessary to either finetune these models on large curated instruction following datasets [28, 42, 12] and/or employ RLHF, where the LLM is trained to maximize a reward function learned from human preferences as a contextual bandit [58]. Typically, this is done using policy-gradient style methods [45, 38] though more recent works learn directly from preference data [34, 19, 3]. While these methods are effective at tasks like summarization [41, 46, 47] and instruction following [32, 30] they require thousands to hundreds of thousands of paired comparisons to obtain a quality estimate of reward. This makes them prohibitively expensive for a wide range of applications, such as training a customized writing assistant or building a domain-specific chatbot. Group Preference Optimization (GPO) [52] takes a promising step towards few-shot alignment of LLMs; however, preference groups must be pre-defined for meta-learning, which requires a large dataset. On the other hand, Gao et al. [16] uses direct edits to distill latent preferences into prompt-based principles. In place of principles or pairwise feedback, DITTO directly learns preferences from a set of demonstrations, similar to model editing from canonical examples [20]. Drawing from prior studies on programming by demonstration and end-user programming in HCI [13, 14], our work aims at soliciting feedback at a finer-grained level than binary preferences, principles, or prompts.

Self-Improvement. Recent works use iterative sampling to improve LLMs. Aproaches like STaR [50, 51, 2] are supervised by verifying the correctness of outputs, while Yuan et al. [48] and Burns et al. [8] use (potentially stronger) language models as critics. Unlike these approaches, DITTO does not require external signals besides demonstrations, similar to self-play methods like SPIN [10]. Unlike SPIN—which uses thousands of demonstrations and is targeted more towards SFT scale datasets—DITTO is designed for fast adaptation in the data-limited setting and thus has a few key distinctions. Namely, DITTO does not update the reference policy and uses intermodel comparisons to combat overfitting. We found these changes to be important to obtain good performance with only a handful of demonstrations. In data-abundant settings, other works have shown that an oracle reward function [18] or model [26] is sufficient to provide feedback. We consider tasks like personalization, for which there is no abundant data or oracle.

Online Imitation Learning. DITTO builds on online imitation learning, which appeals to the long-standing success of learning reward functions from comparisons [15, 1]. Brown et al. [5] first showed that with ranked demonstrations, one could improve a policy beyond the demonstrator’s performance. Follow-ups used automatic noise injection to remove human rankings [6]. Other contemporary approaches to online imitation learning are based on adversarial games between reward and policy players [57, 21]. In our case, we use a KL-constrained formulation, like Watson et al. [43]. Sikchi et al. [39] generalizes the adversarial game to a ranking game and thus uses generated comparisons like DITTO. Unlike DITTO, however, these approaches explicitly require learning a reward function and are designed for continuous control—not for LLMs.

3 DITTO

While prior works use thousands of comparisons to align LLMs, DITTO instead uses only a handful of expert demonstrations to alter a model’s behavior. This type of cheap, rapid adaptation is enabled by our core insight; that online comparison data can be easily obtained from demonstrations.

3.1 Notation and Background

A language model can be viewed as policy π(y|x)𝜋conditional𝑦𝑥\pi(y|x)italic_π ( italic_y | italic_x ) that produces a distribution over completions y𝑦yitalic_y to a prompt x𝑥xitalic_x. In RLHF, our objective is to train an LLM to maximize a reward function r(x,y)𝑟𝑥𝑦r(x,y)italic_r ( italic_x , italic_y ) that measures the quality of a prompt-completion pair (x,y)𝑥𝑦(x,y)( italic_x , italic_y ). Typically, a KL-divergence constraint is added to prevent the updated model from straying too far from a base LM [58], which we denote as πrefsubscript𝜋ref\pi_{\textrm{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT. Altogether, RLHF methods optimize the following objective,

𝒥KL(π)=𝔼yπ(|x),xp[r(x,y)αlogπ(y|x)πref(y|x)]\mathcal{J}_{\text{\tiny{KL}}}(\pi)=\mathbb{E}_{y\sim\pi(\cdot|x),x\sim p}% \left[r(x,y)-\alpha\log\frac{\pi(y|x)}{\pi_{\textrm{ref}}(y|x)}\right]caligraphic_J start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_π ) = blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π ( ⋅ | italic_x ) , italic_x ∼ italic_p end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) - italic_α roman_log divide start_ARG italic_π ( italic_y | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG ] (1)

which maximizes the expected reward over the prompt distribution p𝑝pitalic_p subject to a KL-constraint modulated by α𝛼\alphaitalic_α. Usually, this objective is optimized using a comparison dataset of the form {(x,yw,yl)}𝑥superscript𝑦𝑤superscript𝑦𝑙\{(x,y^{w},y^{l})\}{ ( italic_x , italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) }, where the “win” completion ywsuperscript𝑦𝑤y^{w}italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT is preferred to the “loss” completion ylsuperscript𝑦𝑙y^{l}italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, which we write as ywylsucceeds-or-equalssuperscript𝑦𝑤superscript𝑦𝑙y^{w}\succeq y^{l}italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ⪰ italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT.

While this objective is ubiquitous in prior work [32, 34], it is typically applied in the context of population-based reward functions learned from large comparison datasets collected via a multitude of annotators. In contrast, we consider r(x,y)𝑟𝑥𝑦r(x,y)italic_r ( italic_x , italic_y ) to be the objective of a single individual. In this regime, collecting thousands of comparisons from one user is infeasible. Instead, we assume access to a small dataset of expert demonstrations, denoted 𝒟Esubscript𝒟𝐸\mathcal{D}_{E}caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT. We assume these demonstrations to be generated from the expert policy πE=argmaxπ𝔼yπ(|x),xp[r(x,y)]\pi_{E}=\operatorname*{arg\,max}_{\pi}\mathbb{E}_{y\sim\pi(\cdot|x),x\sim p}[r% (x,y)]italic_π start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π ( ⋅ | italic_x ) , italic_x ∼ italic_p end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ], which maximizes reward in expectation. While demonstrations are typically used for SFT, such approaches typically struggle in data-limited settings. On the other hand, it can be difficult to prompt a model to “overcome” the priors induced by its RLHF training. DITTO, as described in the next section, addresses these problems by directly generating comparison data using LM outputs and expert demonstrations. This means that unlike synthetic data generation paradigms [26], DITTO does not require a model that performs well at the given task a priori.

Input : LM πrefsubscript𝜋ref\pi_{\textrm{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT, demos 𝒟E={(xi,yiE)}iNsubscript𝒟𝐸subscriptsubscript𝑥𝑖subscriptsuperscript𝑦𝐸𝑖𝑖𝑁\mathcal{D}_{E}=\{(x_{i},y^{E}_{i})\}_{i\in N}caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT, sample size M𝑀Mitalic_M, sample frequency K𝐾Kitalic_K
Init : π0SFT(πref,𝒟E)subscript𝜋0SFTsubscript𝜋refsubscript𝒟𝐸\pi_{0}\leftarrow\textrm{{SFT}}(\pi_{\textrm{ref}},\mathcal{D}_{E})italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← SFT ( italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ), t=0𝑡0t=0italic_t = 0
while not converged do
      𝒟ti=1N{(xi,yjπt(|xi)}j=1M\mathcal{D}_{t}\leftarrow\cup_{i=1}^{N}\{(x_{i},y_{j}\sim\pi_{t}(\cdot|x_{i})% \}_{j=1}^{M}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← ∪ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT
      for k=1,2,3,,K𝑘123𝐾k=1,2,3,...,Kitalic_k = 1 , 2 , 3 , … , italic_K do
           Sample batch B={(x,yw,yl)}𝐵𝑥superscript𝑦𝑤superscript𝑦𝑙B=\{(x,y^{w},y^{l})\}italic_B = { ( italic_x , italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) } of comparisons from induced ranking:
           𝒟E𝒟t𝒟t1𝒟0succeeds-or-equalssubscript𝒟𝐸subscript𝒟𝑡succeeds-or-equalssubscript𝒟𝑡1succeeds-or-equalssucceeds-or-equalssubscript𝒟0\quad\quad\mathcal{D}_{E}\succeq\mathcal{D}_{t}\succeq\mathcal{D}_{t-1}\succeq% ...\succeq\mathcal{D}_{0}caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ⪰ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⪰ caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ⪰ … ⪰ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
           πtDPO(πt,B)subscript𝜋𝑡DPOsubscript𝜋𝑡𝐵\pi_{t}\leftarrow\text{DPO}(\pi_{t},B)italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← DPO ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_B ) # Update policy
           tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1
Algorithm 1 DITTO

3.2 DITTO

The key insight of DITTO is that the LM itself, along with the expert demonstrations, can generate comparison datasets for alignment, removing the need to collect a large number of pairwise preferences. This results in a contrastive-like objective, where the expert demonstrations are positives. Here we provide an intuitive explanation of DITTO; later we provide a more theoretical derivation in §3.3.

Generating Comparisons.

Consider a completion sampled from the expert policy, yEπE(|x)y^{E}\sim\pi_{E}(\cdot|x)italic_y start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( ⋅ | italic_x ). By virtue of being “expert”, yEsuperscript𝑦𝐸y^{E}italic_y start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT is likely to have high reward, as πEsubscript𝜋𝐸\pi_{E}italic_π start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT is definitionally the reward maximizer in expectation. Consequently, we would expect samples from any other policy π𝜋\piitalic_π to have rewards less than or equal to those of πEsubscript𝜋𝐸\pi_{E}italic_π start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT, i.e., π,𝔼πE[r(x,y)]𝔼π[r(x,y)]for-all𝜋subscript𝔼subscript𝜋𝐸delimited-[]𝑟𝑥𝑦subscript𝔼𝜋delimited-[]𝑟𝑥𝑦\forall\pi,\mathbb{E}_{\pi_{E}}[r(x,y)]\geq\mathbb{E}_{\pi}[r(x,y)]∀ italic_π , blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] ≥ blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ]. Using this observation, we can construct comparisons (x,yE,yπ)𝑥superscript𝑦𝐸superscript𝑦𝜋(x,y^{E},y^{\pi})( italic_x , italic_y start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) where yEyπsucceeds-or-equalssuperscript𝑦𝐸superscript𝑦𝜋y^{E}\succeq y^{\pi}italic_y start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ⪰ italic_y start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT by simply sampling completions yππ(|x)y^{\pi}\sim\pi(\cdot|x)italic_y start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ∼ italic_π ( ⋅ | italic_x ) for every demonstration-prompt pair in 𝒟Esubscript𝒟𝐸\mathcal{D}_{E}caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT. Though such comparisons are derived from policies instead of individual examples, they have proven effective in prior work [6]. A naïve approach for DITTO would then optimize Eq. 1 using this dataset and an off-the-shelf RLHF algorithm. Doing so would increase the probability of the expert responses while decreasing the probability of the current model samples, unlike standard finetuning which only does the former. Crucially, using samples from π𝜋\piitalic_π allows us to construct an unbounded preference dataset given only a few demonstrations. However, we can do better by considering the temporal aspect of the learning process.

From Comparisons to Rankings.

Using comparisons only between the expert and single policy π𝜋\piitalic_π may be insufficient for obtaining good performance. Doing so decreases likelihoods only at that specific π𝜋\piitalic_π, leading to the overfitting problems that plague SFT in low-data regimes. Analogous to replay in RL [29], we can consider data generated from all policies learned over time during RLHF.

At the first iteration, let the initial policy be π0subscript𝜋0\pi_{0}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We can sample from this policy to assemble a dataset 𝒟0={(x,yπ0)}subscript𝒟0𝑥superscript𝑦subscript𝜋0\mathcal{D}_{0}=\{(x,y^{\pi_{0}})\}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { ( italic_x , italic_y start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) }. Then, we can generate comparison data for RLHF as yEyπ0succeeds-or-equalssuperscript𝑦𝐸superscript𝑦subscript𝜋0y^{E}\succeq y^{\pi_{0}}italic_y start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ⪰ italic_y start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, which we denote as 𝒟E𝒟0succeeds-or-equalssubscript𝒟𝐸subscript𝒟0\mathcal{D}_{E}\succeq\mathcal{D}_{0}caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ⪰ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for brevity. Using these induced comparisons, we update π0subscript𝜋0\pi_{0}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to obtain a new policy π1subscript𝜋1\pi_{1}italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. By definition, 𝔼πE[r(x,y)]𝔼π1[r(x,y)]subscript𝔼subscript𝜋𝐸delimited-[]𝑟𝑥𝑦subscript𝔼subscript𝜋1delimited-[]𝑟𝑥𝑦\mathbb{E}_{\pi_{E}}[r(x,y)]\geq\mathbb{E}_{\pi_{1}}[r(x,y)]blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] ≥ blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] as well. It follows that we can also generate comparisons using π1subscript𝜋1\pi_{1}italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as 𝒟E𝒟1succeeds-or-equalssubscript𝒟𝐸subscript𝒟1\mathcal{D}_{E}\succeq\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ⪰ caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Continuing this procedure, we generate a progressively more diverse comparison dataset using all prior policies. We refer to these as “replay” comparisons.

While this approach is theoretically consistent, it decreases the likelihood of the LM everywhere except at expert demonstrations. Though permissible in data rich scenarios, this may also lead to overfitting with a small 𝒟Esubscript𝒟𝐸\mathcal{D}_{E}caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT. However, if we assume that the policy improves at each iteration, i.e. 𝔼πt+1[r(x,y)]𝔼πt[r(x,y)]subscript𝔼subscript𝜋𝑡1delimited-[]𝑟𝑥𝑦subscript𝔼subscript𝜋𝑡delimited-[]𝑟𝑥𝑦\mathbb{E}_{\pi_{t+1}}[r(x,y)]\geq\mathbb{E}_{\pi_{t}}[r(x,y)]blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] ≥ blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ], then we can also consider comparisons between policies during the course of learning. Unlike comparisons with the expert, we do not guarantee that this holds; in practice, however, we found that models tended to improve with each iteration, perhaps owing to the convexity of both reward modeling and Eq. 1. This lets us sample comparisons between the complete ranking of policies

𝒟E𝒟t𝒟t1𝒟1𝒟0.succeeds-or-equalssubscript𝒟𝐸subscript𝒟𝑡succeeds-or-equalssubscript𝒟𝑡1succeeds-or-equalssucceeds-or-equalssubscript𝒟1succeeds-or-equalssubscript𝒟0\mathcal{D}_{E}\succeq\mathcal{D}_{t}\succeq\mathcal{D}_{t-1}\succeq...\succeq% \mathcal{D}_{1}\succeq\mathcal{D}_{0}.caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ⪰ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⪰ caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ⪰ … ⪰ caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⪰ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . (2)

The effect of adding these “intermodel” and “replay” comparisons is that the likelihoods of earlier samples (e.g., those in 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) are pushed down more than those of later samples (e.g., those in 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT), smoothing the implicit reward landscape. Our practical implementation aggregates a handful of these intermodel comparisons in addition to comparisons with the expert.

A Practical Algorithm.

In practice, the DITTO algorithm is an iterative procedure comprised of three simple components as outlined in Algorithm 1. First, we begin by running supervised fine-tuning on the set of expert demonstrations for a limited number of gradient steps. We set this to be the initial policy π0subscript𝜋0\pi_{0}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Second, we sample comparisons: at most K𝐾Kitalic_K times during the training process, we construct a new dataset 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by sampling M𝑀Mitalic_M completions from πtsubscript𝜋𝑡\pi_{t}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for each of the N𝑁Nitalic_N demonstrations in 𝒟Esubscript𝒟𝐸\mathcal{D}_{E}caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT and add it to the ranking over policies Eq. 2. When sampling comparisons from Eq. 2 each batch B𝐵Bitalic_B is comprised of 70%“online” comparisons 𝒟E𝒟tsucceeds-or-equalssubscript𝒟𝐸subscript𝒟𝑡\mathcal{D}_{E}\succeq\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ⪰ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, 20% “replay” comparisons of the form 𝒟E𝒟i<tsucceeds-or-equalssubscript𝒟𝐸subscript𝒟𝑖𝑡\mathcal{D}_{E}\succeq\mathcal{D}_{i<t}caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ⪰ caligraphic_D start_POSTSUBSCRIPT italic_i < italic_t end_POSTSUBSCRIPT, and 10% “intermodel comparisons” of the form 𝒟it𝒟j<isucceeds-or-equalssubscript𝒟𝑖𝑡subscript𝒟𝑗𝑖\mathcal{D}_{i\leq t}\succeq\mathcal{D}_{j<i}caligraphic_D start_POSTSUBSCRIPT italic_i ≤ italic_t end_POSTSUBSCRIPT ⪰ caligraphic_D start_POSTSUBSCRIPT italic_j < italic_i end_POSTSUBSCRIPT. Finally, we update the policy using RLHF. Specifically, using batches sampled via the aforementioned procedure, we update the policy πtsubscript𝜋𝑡\pi_{t}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to obtain πt+1subscript𝜋𝑡1\pi_{t+1}italic_π start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT using the DPO [34] loss function

DPO(π,𝒟)=𝔼(x,yw,yl)𝒟[logσ(αlogπ(yw|x)πref(yw|x)αlogπ(yl|x)πref(yl|x))].subscriptDPO𝜋𝒟subscript𝔼similar-to𝑥superscript𝑦𝑤superscript𝑦𝑙𝒟delimited-[]𝜎𝛼𝜋conditionalsuperscript𝑦𝑤𝑥subscript𝜋refconditionalsuperscript𝑦𝑤𝑥𝛼𝜋conditionalsuperscript𝑦𝑙𝑥subscript𝜋refconditionalsuperscript𝑦𝑙𝑥\mathcal{L}_{\text{DPO}}(\pi,\mathcal{D})=-\mathbb{E}_{(x,y^{w},y^{l})\sim% \mathcal{D}}\left[\log\sigma\left(\alpha\log\tfrac{\pi(y^{w}|x)}{\pi_{\textrm{% ref}}(y^{w}|x)}-\alpha\log\tfrac{\pi(y^{l}|x)}{\pi_{\textrm{ref}}(y^{l}|x)}% \right)\right].caligraphic_L start_POSTSUBSCRIPT DPO end_POSTSUBSCRIPT ( italic_π , caligraphic_D ) = - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_α roman_log divide start_ARG italic_π ( italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT | italic_x ) end_ARG - italic_α roman_log divide start_ARG italic_π ( italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | italic_x ) end_ARG ) ] .

where σ𝜎\sigmaitalic_σ is the logistic function from the Bradley-Terry preference model. During each update, we do not update the reference model πrefsubscript𝜋ref\pi_{\textrm{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT from the SFT policy to avoid straying too far from initialization.

3.3 Deriving DITTO as Online Imitation Learning

DITTO can be derived through an online imitation learning perspective, where expert demonstrations are used in conjunction with online data to simultaneously learn a reward function and policy. Specifically, the policy player maximizes expected reward maxπ𝒥(π,r)subscript𝜋𝒥𝜋𝑟\max_{\pi}\mathcal{J}(\pi,r)roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT caligraphic_J ( italic_π , italic_r ), as the reward player minimizes its loss minr(𝒟π,r)subscript𝑟superscript𝒟𝜋𝑟\min_{r}\mathcal{L}(\mathcal{D}^{\pi},r)roman_min start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT caligraphic_L ( caligraphic_D start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT , italic_r ) over an online dataset 𝒟πsuperscript𝒟𝜋\mathcal{D}^{\pi}caligraphic_D start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT. Concretely, we instantiate this optimization problem using the policy objective in Eq. 1 and the standard reward modeling loss

minr{𝔼(x,yw,yl)𝒟π[logσ(r(x,yw)r(x,yl))] s.t. π=argmaxπ𝒥KL(π,r)}.subscript𝑟subscript𝔼similar-to𝑥superscript𝑦𝑤superscript𝑦𝑙subscript𝒟𝜋delimited-[]𝜎𝑟𝑥superscript𝑦𝑤𝑟𝑥superscript𝑦𝑙 s.t. 𝜋subscript𝜋subscript𝒥KL𝜋𝑟\min_{r}\left\{-\mathbb{E}_{(x,y^{w},y^{l})\sim\mathcal{D}_{\pi}}\left[\log% \sigma(r(x,y^{w})-r(x,y^{l}))\right]\text{ s.t. }\pi=\arg\max_{\pi}\mathcal{J}% _{\text{\tiny{KL}}}(\pi,r)\right\}.roman_min start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT { - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ∼ caligraphic_D start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_r ( italic_x , italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ) - italic_r ( italic_x , italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ) ] s.t. italic_π = roman_arg roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_π , italic_r ) } . (3)

As done in prior work [39], we take 𝒟πsuperscript𝒟𝜋\mathcal{D}^{\pi}caligraphic_D start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT to be a dataset of comparisons such that yπyπsucceeds-or-equalssuperscript𝑦𝜋superscript𝑦superscript𝜋y^{\pi}\succeq y^{\pi^{\prime}}italic_y start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ⪰ italic_y start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT if 𝔼π[r(x,y)]𝔼π[r(x,y)]subscript𝔼𝜋delimited-[]𝑟𝑥𝑦subscript𝔼superscript𝜋delimited-[]𝑟𝑥𝑦\mathbb{E}_{\pi}[r(x,y)]\geq\mathbb{E}_{\pi^{\prime}}[r(x,y)]blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] ≥ blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ]. The π𝜋\piitalic_π superscript indicates that 𝒟πsuperscript𝒟𝜋\mathcal{D}^{\pi}caligraphic_D start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT contains online comparisons between π𝜋\piitalic_π and the expert πEsubscript𝜋𝐸\pi_{E}italic_π start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT. By using different choices of regularizers and comparison data, one can arrive at different inverse RL (IRL) objectives [21].

Deriving DITTO.

The first step in simplifying Eq. 3 is addressing the inner policy maximization. Fortunately, from Ziebart [56] we know that the policy objective 𝒥KLsubscript𝒥KL\mathcal{J}_{\text{\tiny{KL}}}caligraphic_J start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT has a closed form solution of the form π(y|x)=πref(y|x)er(x,y)/α/Z(x)superscript𝜋conditional𝑦𝑥subscript𝜋refconditional𝑦𝑥superscript𝑒𝑟𝑥𝑦𝛼𝑍𝑥\pi^{\star}(y|x)=\pi_{\textrm{ref}}(y|x)e^{r(x,y)/\alpha}/Z(x)italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_y | italic_x ) = italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) italic_e start_POSTSUPERSCRIPT italic_r ( italic_x , italic_y ) / italic_α end_POSTSUPERSCRIPT / italic_Z ( italic_x ) where Z(x)𝑍𝑥Z(x)italic_Z ( italic_x ) is the partition function normalizing the distribution. Notably, this establishes a bijection between policies and reward functions which we can use to eliminate the inner optimization. By rearranging this solution, we can write the reward function r𝑟ritalic_r as

r(x,y)=αlogπ(y|x))πref(y|x)αlogZ(x).r(x,y)=\alpha\log\tfrac{\pi^{\star}(y|x))}{\pi_{\textrm{ref}}(y|x)}-\alpha\log Z% (x).italic_r ( italic_x , italic_y ) = italic_α roman_log divide start_ARG italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_y | italic_x ) ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG - italic_α roman_log italic_Z ( italic_x ) .

Furthermore, prior work [35] shows that this reparameterization can express any reward function. Thus, we can perform a change of variables from r𝑟ritalic_r to π𝜋\piitalic_π by substitution into Eq. 3, giving us the DITTO objective

minπ𝔼𝒟π[logσ(αlogπ(yw|x)πref(yw|x)αlogπ(yl|x)πref(yl|x))].subscript𝜋subscript𝔼superscript𝒟𝜋delimited-[]𝜎𝛼𝜋conditionalsuperscript𝑦𝑤𝑥subscript𝜋refconditionalsuperscript𝑦𝑤𝑥𝛼𝜋conditionalsuperscript𝑦𝑙𝑥subscript𝜋refconditionalsuperscript𝑦𝑙𝑥\min_{\pi}-\mathbb{E}_{\mathcal{D}^{\pi}}\left[\log\sigma\left(\alpha\log% \tfrac{\pi(y^{w}|x)}{\pi_{\textrm{ref}}(y^{w}|x)}-\alpha\log\tfrac{\pi(y^{l}|x% )}{\pi_{\textrm{ref}}(y^{l}|x)}\right)\right].roman_min start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT - blackboard_E start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_α roman_log divide start_ARG italic_π ( italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT | italic_x ) end_ARG - italic_α roman_log divide start_ARG italic_π ( italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | italic_x ) end_ARG ) ] .

Note that like DPO, we implicitly estimate the reward function. Unlike DPO, DITTO depends on an online dataset of preferences 𝒟πsuperscript𝒟𝜋\mathcal{D}^{\pi}caligraphic_D start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT. At a minimum, the online preference dataset ought to contain comparisons πEπ,πsucceeds-or-equalssubscript𝜋𝐸𝜋for-all𝜋\pi_{E}\succeq\pi,\forall\piitalic_π start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ⪰ italic_π , ∀ italic_π. However, any preferences consistent with the ground-truth reward function can additionally be used. We leave this exploration to future work.

Why does DITTO work better than SFT alone?

One reason for DITTO’s relatively high performance is that it uses far more data than SFT by generating comparisons. Another is that online imitation learning methods can, in some circumstances, perform better than the demonstrator while SFT only mimics the demonstrations. While this is known in the IRL community, we show the following result in Appendix A to relate DITTO’s ability to extrapolate beyond the demonstrator to two divergence measures.

Lemma 3.1.

(Adapted from Brown et al. [6]) Let πsuperscript𝜋\pi^{\star}italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT be the optimal policy for Eq. 1 and π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG be the policy estimated by DITTO using expert demonstrations 𝒟Esubscript𝒟𝐸\mathcal{D}_{E}caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT. Extrapolation beyond the demonstrator, i.e. 𝔼π^[r(x,y)]>𝔼𝒟E[r(x,y)]subscript𝔼^𝜋delimited-[]𝑟𝑥𝑦subscript𝔼subscript𝒟𝐸delimited-[]𝑟𝑥𝑦\mathbb{E}_{\hat{\pi}}[r(x,y)]>\mathbb{E}_{\mathcal{D}_{E}}[r(x,y)]blackboard_E start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] > blackboard_E start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] is guaranteed if 𝒥KL(π)𝔼𝒟E[r(x,y)]>αDKL(π^||π)αDKL(π^||πref)\mathcal{J}_{\text{\tiny{KL}}}(\pi^{\star})-\mathbb{E}_{\mathcal{D}_{E}}[r(x,y% )]>\alpha D_{\text{\tiny{KL}}}\left(\hat{\pi}||\pi^{\star}\right)-\alpha D_{% \text{\tiny{KL}}}\left(\hat{\pi}||\pi_{\textrm{ref}}\right)caligraphic_J start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) - blackboard_E start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] > italic_α italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG | | italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) - italic_α italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG | | italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ).

4 Experiments

We first outline benchmarks, focusing on tasks with subjective preferences (e.g., email writing, essays, articles). We then discuss automatic evaluation, compare DITTO to several baselines, and outline results. Finally, we conduct a user study with DITTO, soliciting demonstrations from participants.

4.1 Static Benchmarks

Data

Measuring few-shot alignment with DITTO requires demonstrations from individuals instead of aggregated datasets. We therefore build on prior Author Attribution (AA) datasets. The AA task requires one to determine which author a𝑎aitalic_a from a set of authors A𝐴Aitalic_A wrote a specific document. We can reframe prior AA classification tasks as effective alignment: aligning an LLM to a specific author should result in generations that are more likely to be attributed to the same author. We collect data from 20 distinct authors from two sources: (1) emails and blog posts from the CMCC dataset [17] that contain only one author and (2) news articles from the CCAT dataset [27]. For more dataset details, we refer the reader to Appendix 4.

Splits and Preprocessing

Some of our benchmarks have more writing samples per author than others. While the original CCAT can have more than 50 samples per author, CMCC can have as few as 12. To control for sample count, we randomly select the smallest set of demonstrations available from each author across our training splits (12) for our experiments. We randomly select 10 authors from each dataset, use 7 samples to train, and split the remainder into test and validation. Table 4 in the Appendix describes the finalized train/val/test counts across each benchmark.

Models and Baselines

Alongside DITTO, we evaluate continued supervised fine-tuning (SFT), where we simply fine-tune on the expert demonstrations 𝒟E.subscript𝒟𝐸\mathcal{D}_{E}.caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT . We also evaluate SPIN [10], an iterative self-play method designed to replace SFT. Finally, we test zero-shot and few-shot prompting, including demonstrations directly in the model’s context. For few-shot prompting, we add the train set of an author in-context. Our experiments require a base, instruction following LLM. We use Mistral Instruct v0.2 7B as a starting point [23] and train using LoRA [22]. Finally, we compare against zero/few-shot prompting with a more powerful LLM (GPT-4). Hyperparameter details are in Appendix C.

Automatic Evaluation

Given that our datasets contain a total of 20 authors, we must train and evaluate a large set of models (20 authors x 7 training paradigms = 140 models). To facilitate the evaluation process, we use GPT-4222We use the gpt-4-0613 version of GPT-4. We observed that Turbo versions of GPT-4 were more biased towards their own outputs. Queries were run between December 20th, 2023 to May 10th, 2024. to compare the outputs of models across various conditions. Prior work has used GPT to both annotate and evaluate text [53]. Performance lags behind human evaluation; however, GPT-4 eval generally outperforms other automatic metrics, allowing us to scale hyperparameter search and run evaluation in a more cost-effective manner.

In our setting, we use GPT-4 to determine if a text sounds more or less like a specific author. Given an author-written text t𝑡titalic_t and two pairs of generated text from different conditions a𝑎aitalic_a and b𝑏bitalic_b, we prompt GPT-4 to select the text that most closely matches the validation or test text t𝑡titalic_t, and compute averaged head-to-head win rates. Prompting details are outlined in Appendix D.

Data Method aavgsubscript𝑎avga_{\mathrm{avg}}italic_a start_POSTSUBSCRIPT roman_avg end_POSTSUBSCRIPT a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT a3subscript𝑎3a_{3}italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT a4subscript𝑎4a_{4}italic_a start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT a5subscript𝑎5a_{5}italic_a start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT a6subscript𝑎6a_{6}italic_a start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT a7subscript𝑎7a_{7}italic_a start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT a8subscript𝑎8a_{8}italic_a start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT a9subscript𝑎9a_{9}italic_a start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT a10subscript𝑎10a_{10}italic_a start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT
CMCC GPT zero-shot 31.893.05subscript31.893.0531.89_{3.05}31.89 start_POSTSUBSCRIPT 3.05 end_POSTSUBSCRIPT 43.0643.0643.0643.06 29.1729.1729.1729.17 22.2222.2222.2222.22 37.0437.0437.0437.04 18.5218.5218.5218.52 42.5942.5942.5942.59 19.4419.4419.4419.44 40.2840.2840.2840.28 40.2840.2840.2840.28 31.4831.4831.4831.48
few-shot 63.893.18subscript63.893.1863.89_{3.18}63.89 start_POSTSUBSCRIPT 3.18 end_POSTSUBSCRIPT 73.6173.61\mathbf{73.61}bold_73.61 68.0668.0668.0668.06 62.5062.5062.5062.50 62.0462.0462.0462.04 55.5655.5655.5655.56 64.8164.8164.8164.81 75.9375.93\mathbf{75.93}bold_75.93 63.8963.89\mathbf{63.89}bold_63.89 40.2840.2840.2840.28 68.5268.5268.5268.52
Mistral zero-shot 27.332.24subscript27.332.2427.33_{2.24}27.33 start_POSTSUBSCRIPT 2.24 end_POSTSUBSCRIPT 34.7234.7234.7234.72 30.5630.5630.5630.56 16.6716.6716.6716.67 29.6329.6329.6329.63 27.7827.7827.7827.78 30.5630.5630.5630.56 19.4419.4419.4419.44 38.8938.8938.8938.89 19.4419.4419.4419.44 26.8526.8526.8526.85
few-shot 46.894.76subscript46.894.7646.89_{4.76}46.89 start_POSTSUBSCRIPT 4.76 end_POSTSUBSCRIPT 61.1161.1161.1161.11 76.3976.39\mathbf{76.39}bold_76.39 26.3926.3926.3926.39 30.5630.5630.5630.56 42.5942.5942.5942.59 52.7852.7852.7852.78 37.0437.0437.0437.04 41.6741.6741.6741.67 54.1754.1754.1754.17 54.6354.6354.6354.63
SPIN 51.563.85subscript51.563.8551.56_{3.85}51.56 start_POSTSUBSCRIPT 3.85 end_POSTSUBSCRIPT 56.9456.9456.9456.94 48.6148.6148.6148.61 56.9456.9456.9456.94 40.7440.7440.7440.74 73.1573.1573.1573.15 48.1548.1548.1548.15 59.2659.2659.2659.26 59.7259.7259.7259.72 31.9431.9431.9431.94 38.8938.8938.8938.89
SFT 56.787.04subscript56.787.0456.78_{7.04}56.78 start_POSTSUBSCRIPT 7.04 end_POSTSUBSCRIPT 18.0618.0618.0618.06 27.7827.7827.7827.78 86.1186.11\mathbf{86.11}bold_86.11 74.0774.0774.0774.07 58.3358.3358.3358.33 43.5243.5243.5243.52 64.8164.8164.8164.81 47.2247.2247.2247.22 81.9481.9481.9481.94 58.3358.3358.3358.33
DITTO 71.672.30subscript71.672.30\mathbf{71.67_{2.30}}bold_71.67 start_POSTSUBSCRIPT bold_2.30 end_POSTSUBSCRIPT 62.5062.5062.5062.50 69.4469.4469.4469.44 79.1779.1779.1779.17 75.9375.93\mathbf{75.93}bold_75.93 74.0774.07\mathbf{74.07}bold_74.07 67.5967.59\mathbf{67.59}bold_67.59 74.0774.0774.0774.07 58.3358.3358.3358.33 81.9481.94\mathbf{81.94}bold_81.94 71.3071.30\mathbf{71.30}bold_71.30
CCAT GPT zero-shot 19.351.40subscript19.351.4019.35_{1.40}19.35 start_POSTSUBSCRIPT 1.40 end_POSTSUBSCRIPT 19.4419.4419.4419.44 24.0724.0724.0724.07 25.0025.0025.0025.00 18.5218.5218.5218.52 12.9612.9612.9612.96 20.3720.3720.3720.37 12.0412.0412.0412.04 23.1523.1523.1523.15 16.6716.6716.6716.67 21.3021.3021.3021.30
few-shot 53.702.19subscript53.702.1953.70_{2.19}53.70 start_POSTSUBSCRIPT 2.19 end_POSTSUBSCRIPT 64.8164.8164.8164.81 53.7053.7053.7053.70 61.1161.1161.1161.11 53.7053.7053.7053.70 47.2247.2247.2247.22 44.4444.4444.4444.44 45.3745.3745.3745.37 61.1161.1161.1161.11 52.7852.7852.7852.78 52.7852.7852.7852.78
Mistral zero-shot 18.061.61subscript18.061.6118.06_{1.61}18.06 start_POSTSUBSCRIPT 1.61 end_POSTSUBSCRIPT 13.8913.8913.8913.89 23.1523.1523.1523.15 15.7415.7415.7415.74 12.9612.9612.9612.96 13.8913.8913.8913.89 22.2222.2222.2222.22 17.5917.5917.5917.59 14.8114.8114.8114.81 28.7028.7028.7028.70 17.5917.5917.5917.59
few-shot 40.372.33subscript40.372.3340.37_{2.33}40.37 start_POSTSUBSCRIPT 2.33 end_POSTSUBSCRIPT 56.4856.4856.4856.48 45.3745.3745.3745.37 35.1935.1935.1935.19 32.4132.4132.4132.41 41.6741.6741.6741.67 39.8139.8139.8139.81 46.3046.3046.3046.30 35.1935.1935.1935.19 34.2634.2634.2634.26 37.0437.0437.0437.04
SPIN 62.133.11subscript62.133.1162.13_{3.11}62.13 start_POSTSUBSCRIPT 3.11 end_POSTSUBSCRIPT 56.4856.4856.4856.48 69.4469.4469.4469.44 55.5655.5655.5655.56 82.4182.41\mathbf{82.41}bold_82.41 70.3770.3770.3770.37 54.6354.6354.6354.63 58.3358.3358.3358.33 54.6354.6354.6354.63 51.8551.8551.8551.85 67.5967.5967.5967.59
SFT 73.892.50subscript73.892.5073.89_{2.50}73.89 start_POSTSUBSCRIPT 2.50 end_POSTSUBSCRIPT 61.1161.1161.1161.11 62.0462.0462.0462.04 76.8576.8576.8576.85 72.2272.2272.2272.22 80.5680.5680.5680.56 81.4881.4881.4881.48 80.5680.5680.5680.56 68.5268.5268.5268.52 82.4182.4182.4182.41 73.1573.1573.1573.15
DITTO 82.501.93subscript82.501.93\mathbf{82.50_{1.93}}bold_82.50 start_POSTSUBSCRIPT bold_1.93 end_POSTSUBSCRIPT 77.7877.78\mathbf{77.78}bold_77.78 72.2272.22\mathbf{72.22}bold_72.22 80.5680.56\mathbf{80.56}bold_80.56 77.7877.7877.7877.78 83.3383.33\mathbf{83.33}bold_83.33 87.0487.04\mathbf{87.04}bold_87.04 89.8189.81\mathbf{89.81}bold_89.81 92.5992.59\mathbf{92.59}bold_92.59 83.3383.33\mathbf{83.33}bold_83.33 80.5680.56\mathbf{80.56}bold_80.56
Table 1: GPT-4 Eval: Head-to-head win rates between methods across benchmark test splits. DITTO outperforms all baseline methods on average and across a plurality of individual authors. a1a10subscript𝑎1subscript𝑎10a_{1}...a_{10}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_a start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT represents a single model trained on one of ten sampled authors from each dataset (see §4). Results are averaged across 3 runs, with 3 samples generated from each model with temperature 1.0. We also report win rates averaged across authors, along with standard error of the mean (avgsemsubscriptavgsem\mathrm{avg}_{\mathrm{sem}}roman_avg start_POSTSUBSCRIPT roman_sem end_POSTSUBSCRIPT).
Results

Our main results, evaluated with GPT-4 eval, are summarized in Table 1. Averaged across all authors, DITTO outperforms all baselines, with an average 77.09% win-rate across both CMCC (71.67%) and CCAT50 (82.50%). On CCAT50, DITTO outperforms all baselines across authors but one. On CMCC, DITTO outperforms all other baselines for 5/10 authors, followed by few-shot prompting for 3/10. While SFT serves as a strong baseline (56.78% on CMCC, 73.89% on CCAT), DITTO provides an average \uparrow11.7% pt. win rate improvement compared to SFT alone.

Prompted baselines also lag far behind DITTO, especially zero-shot (including closed-source) models (avg. \downarrow54.4% pt. decrease on Mistral, \downarrow51.5% pt. on GPT-4). While zero-shot GPT-4 is already finetuned using RLHF, we suspect that this training feedback differs significantly from that of authors in both CMCC and CCAT50. Adding few-shot examples to the prompt does help: win rates for few-shot prompting increase compared to zero-shot for both Mistral (\uparrow20.94% pt.) and GPT-4 (\uparrow22.95% pt.) based LLMs. However, including few-shot examples still falls behind applying DITTO (avg. \downarrow37.35% pt. decrease for Mistral; \downarrow26.99% pt. for GPT-4). We suspect the underlying RLHF priors for out-of-the-box LLMs are fairly strong. Qualitatively, few-shot generations still sound GPT-generated relative to DITTO (Table 6 in Appendix).

While we do test another self-improvement training method (SFT + SPIN), we find that performance is lower than DITTO (avg.\downarrow 9.3% pt.)—we suspect that design decisions for SPIN (e.g., updating the reference policy, excluding interpolicy / replay comparisons) are targeted towards SFT-scale datasets. We ablate these decisions in §5.1 and propose reasons for performance degradation.

4.2 User Study: Testing Generalization to Naturalistic Tasks

Our static benchmarks have focused on pre-existing author attribution datasets, using GPT-4 to measure alignment. However, GPT-4 eval exhibits a self-enhancement bias, likely inflating performance for LLM-like generations [53, 33]. We therefore evaluate DITTO in a more naturalistic setting; we conduct a user study to evaluate DITTO and ask users to provide demonstrations for a range of tasks. As baselines, we use zero-shot and few-shot prompted GPT-4, along with SFT. Additionally, we ask participants to self-prompt models by iteratively authoring their own prompts to steer the model outputs. Zero-shot, few-shot, and self-prompt emulate what most users would do today to steer LLMs, and SFT provides a strong finetuning baseline. We recruit 16 participants from social media postings (Twitter). Participants were paid $30 per hour, and our study was approved by our institution’s IRB.

Method Win Rate
GPT-4  zero-shot 25.0
  GPT-4  few-shot 48.1
  GPT-4  self-prompt 44.2
SFT 60.1
DITTO 72.1
Table 2: User Study Results. In head-to-head human annotated win rates, DITTO outperforms self-prompted, few-shot, and zero-shot GPT-4 baselines, along with SFT.
User Study Outline

The user study consists of two parts. In the first part, we ask participants to specify four email-writing tasks (e.g., Write an email to your advisor asking for feedback). Participants are asked to provide two demonstrations for two of the tasks (4 training demonstrations in total). To help brainstorm tasks, we generate concrete task suggestions with GPT-4; participants could select from among these or provide their own custom tasks. We randomly split two task prompts into train, and saved two for testing; participants gave two demonstrations each for both the training prompts, to mimic a user willing to only put in minimal effort. Users were provided with default generations from GPT-4 to aid authoring demonstrations, which they could edit or ignore. In the second part, we use the two tasks from the test set and show participants generations across all methods. We sampled one output from each method (self-prompt, zero-shot, few-shot, SFT, and DITTO), and solicited 10 pairwise preferences for each test prompt (resulting in 20 preferences total for each user). In all, we collect a total of 320 pairwise preferences across 16 users. Additional user study details (e.g., interface, examples of demonstrated feedback, prompts for generating tasks, etc.) are in Appendix E.

Results

Our user study results corroborate findings from static benchmarks. DITTO outperforms baseline methods in aligning to demonstrated preferences (Table 2), with DITTO (72.1% win-rate) > SFT (60.1%) > few-shot (48.1%) > self-prompt (44.2%) > zero-shot (25.0%). Additionally, users generally struggle with verbalizing preferences into prompts: self-prompting slightly underperforms providing demonstrations in a few-shot prompt, and substantially underperforms DITTO. We also qualitatively observe that users often edit nearly half of the default output from GPT-4 when authoring demonstrations (examples in Appendix E), with average normalized Levenshtein edit distance = 0.43. Large edits to the output alone highlight the effectiveness of demonstrated feedback as an interaction.

5 When does DITTO work?

A user must decide on several prerequisites before using DITTO, from how many demos they have to how many negatives they must sample from the LM. We explore the impact of these decisions and focus on CMCC, as it covers a broader range of tasks than CCAT. We additionally analyze the sample efficiency of demonstrations vs. pairwise feedback in our user study setting.

5.1 Algorithm Perturbations

DITTO consists of several hyperparameters: namely, the number of DITTO iterations N={1..4}𝑁1..4N=\{1..4\}italic_N = { 1..4 } and negative samples M={210}𝑀210M=\{2...10\}italic_M = { 2 … 10 } generated from our sequence of policies. Separately, we ablate components of DITTO, like the use of inter-policy (𝒟it𝒟j<isucceeds-or-equalssubscript𝒟𝑖𝑡subscript𝒟𝑗𝑖\mathcal{D}_{i\leq t}\succeq\mathcal{D}_{j<i}caligraphic_D start_POSTSUBSCRIPT italic_i ≤ italic_t end_POSTSUBSCRIPT ⪰ caligraphic_D start_POSTSUBSCRIPT italic_j < italic_i end_POSTSUBSCRIPT) and replay (𝒟E𝒟i<tsucceeds-or-equalssubscript𝒟𝐸subscript𝒟𝑖𝑡\mathcal{D}_{E}\succeq\mathcal{D}_{i<t}caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ⪰ caligraphic_D start_POSTSUBSCRIPT italic_i < italic_t end_POSTSUBSCRIPT) comparisons. We also test an ablation where we do not re-sample data during training and instead cumulatively sample all negatives at the start, and where we update πref=πtsubscript𝜋refsubscript𝜋𝑡\pi_{\textrm{ref}}=\pi_{t}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at each iteration like SPIN [10]. We then reevaluate DITTO against ablations and hyperparameter perturbations by computing head-to-head win rates with GPT-4 (similar to §4.1). Because DITTO performance varies from user to user, we convert win rates to % improvement from the first ablation across authors.

Refer to caption
Figure 2: Head-to-head win rates across DITTO hyperparameter perturbations on CMCC. First, increasing the number of DITTO iterations improves GPT-4 eval performance (left). Increasing the number of generated negatives also reduces DITTO variance across users while improving DITTO performance (middle). Finally, increasing demos also improves performance, but we observe diminishing returns (right). Error bars correspond to standard error of the mean across authors.
Ablation Win Rate
Cumulative sample 57.3
DITTO 70.1
\rightarrow remove interpolicy 68.1
\rightarrow remove replay 63.6
\rightarrow update πrefsubscript𝜋ref\pi_{\textrm{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT 45.8
Table 3: Head-to-head win rates across DITTO algorithm ablations on CMCC. We experiment with sampling all negatives upfront (Cumulative Sample), ablating replay and interpolicy comparisons, and updating the reference policy.

Increasing the number of DITTO iterations generally improves performance (Fig. 2). Comparing Iteration 1 to Iteration 4, we observe a relative 31.5% increase in GPT-4 eval win rates. Improvement is non-monotonic—in Iteration 2, performance drops slightly (-3.4%). Early iterations might yield noisier samples, potentially reducing performance. On the other hand, increasing negative samples monotonically improves DITTO performance. Generating 10 negatives for each demonstration in the training set, for example, yields an 21.09% win-rate improvement compared to just 2. Furthermore, as we sample more negatives increases, variance in DITTO performance decreases. However, there is a tradeoff associated with increasing the number of negative samples: runtime of DITTO will also increase. We also hypothesize that increasing iterations, negative samples, etc., will yield diminishing returns; exploring these tradeoffs is an avenue for future work.

We also find that ablating components of DITTO results in reduced performance (Table 3). If we sample all negatives at the start—instead of iteratively resampling in an online fashion—we observe that win rates compared to using DITTO drop from 70.1% to 57.3%. While iteratively re-sampling improves performance, continuously updating πrefsubscript𝜋ref\pi_{\textrm{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT during this online process can significantly degrade performance: win rates drop from 70.1% to 45.8%. We suspect updating πrefsubscript𝜋ref\pi_{\textrm{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT results in potential overfitting. Finally, both replay and inter-policy comparisons help DITTO. Removing replay and interpolicy comparisons reduces win rates from DITTO by 6.5 and 2 points respectively.

5.2 Sample Efficiency

A key affordance of DITTO is its sample efficiency. In §4, we examined DITTO’s performance on the full training set of 7 demonstrations from each author. In practice, a user may only provide one or two demonstrations. Therefore, we evaluate sample efficiency across DITTO trained smaller subsets of the full training corpus N={17}𝑁17N=\{1...7\}italic_N = { 1 … 7 }. Like with our algorithm perturbations, we report per-user normalized win rates (Figure 2). First, we observe that DITTO win rates increase rapidly at the start. From 1N31𝑁31\leq N\leq 31 ≤ italic_N ≤ 3, normalized performance roughly doubles for each additional demonstration (0%5%11.9%percent0percent5percent11.90\%\rightarrow 5\%\rightarrow 11.9\%0 % → 5 % → 11.9 %). However, we observe diminishing returns when supplying extra demonstrations (4N74𝑁74\leq N\leq 74 ≤ italic_N ≤ 7, 11.9%15.39%percent11.9percent15.3911.9\%\rightarrow 15.39\%11.9 % → 15.39 %), indicating that DITTO performance saturates as demonstrations increase. A key design decision in using DITTO lies in the selection of demonstrations; we additionally suspect that the quality of provided demonstrations likely also affects DITTO performance. Understanding how to select an optimal set of demonstrations for DITTO from a user is an avenue for future work.

Refer to caption
Figure 3: Demonstrations are more sample efficient than pairwise preferences for an individual user. We compared DITTO with 4 demos to pairwise prefs sampled from (1) base instruction-following LM πrefsubscript𝜋ref\pi_{\textrm{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT and (2) πrefsubscript𝜋ref\pi_{\textrm{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT fine-tuned on demos. Applying DPO on 500 pairwise preferences—with samples from πrefsubscript𝜋ref\pi_{\textrm{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT—yields no improvement compared to DITTO. Even if demos are used to fine-tune πrefsubscript𝜋ref\pi_{\textrm{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT before sampling, one must collect many pairwise preferences to approach DITTO.

5.3 How do pairwise preferences compare against demonstrations?

A core assumption of DITTO lies in sample efficiency coming from demonstrations. In theory, a user could achieve similar performance by labeling many pairwise preferences with an ideal set of demonstrations in mind. As a preliminary approximation, one author provided demonstrations for the user study and also annotated 500 preference pairs using outputs sampled from the instruction following Mistral 7B (demonstrations in Appendix E.4). Altogether, we constructed a pairwise preferences dataset Dpref={(x,yi,yj)}subscript𝐷𝑝𝑟𝑒𝑓𝑥superscript𝑦𝑖superscript𝑦𝑗D_{pref}=\{(x,y^{i},y^{j})\}italic_D start_POSTSUBSCRIPT italic_p italic_r italic_e italic_f end_POSTSUBSCRIPT = { ( italic_x , italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) }, where yiyjsucceedssubscript𝑦𝑖subscript𝑦𝑗y_{i}\succ y_{j}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. We then computed win rates between 20 pairs sampled from Mistral trained on (a) 4 demonstrations with DITTO, and (b) on {0500}0500\{0...500\}{ 0 … 500 } preference pairs with just DPO. When we sample pairwise preferences from πrefsubscript𝜋ref\pi_{\textrm{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT alone, we observe that generated pairs are out-of-distribution relative to the demonstrations—pairwise preferences do not reach a user’s demonstrated behavior (results in Fig. 3: “Base policy,” in blue). Even when we finetune πrefsubscript𝜋ref\pi_{\textrm{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT on the user’s demonstrations, we still need >500absent500>500> 500 preferences to match DITTO performance (Fig. 3: “Demo-finetuned policy,” in orange).

6 Conclusion

Current modes for soliciting feedback—like principles or pairwise annotations—cater to population-level preferences. In this work, we instead highlight the effectiveness of using demonstrations as feedback, and show that a limited number of demonstrated behaviors can provide a strong signal for preferences specific to an individual. We also introduce a new technique, DITTO, that cheaply generates online comparison data from demonstrations, and test DITTO’s effectiveness across static benchmarks and a user study. Focusing feedback collection at the demonstration level may offer a more diverse overview of individual preferences, and encourage a re-evaluation of the interfaces and interactions used to collect human feedback.

Limitations

First, we noticed a discrepancy between GPT eval and our user study (Appendix D). While we do mitigate this effect with a user study, we still note that GPT eval is biased toward LLM-generated results. Furthermore, GPT eval is highly sensitive to the prompt used for evaluation. Another limitation involves DITTO speed: DITTO is slower than training-free approaches (prompting) and SFT (15 minutes with DITTO vs. 2 minutes with SFT on 7 demonstrations). A bottleneck lies in sampling, though we suspect a mix of prior (e.g., vLLM [25]) and future work in LLM inference optimization can improve DITTO’s speed. Finally, DITTO is uninterpretable. It is unclear exactly what a model learns after several iterations: do values shift too, or is it just style? We also suspect that forgetting may affect DITTO. Even with LoRA, models DITTO-ed on writing sometimes refuse to generate code. Related work on overgeneralization might mitigate these effects [40]. Because of evaluation and computational constraints, we also do not test other model families or sizes. Finally, analyzing tradeoffs between types of preference data (e.g. demonstrations vs. preferences vs. principles) requires additional analysis; each type of feedback requires different levels of effort, and the effectiveness depends on the user providing feedback. We leave exploration for future work.

Broader Impacts

Demonstrated feedback is a double-edged sword. While DITTO can enable effective personalization of language models, we also suspect that DITTO will be especially useful for model un-alignment, amongst a range of other risks [24]. However, the current status quo of language model alignment lies with large corporations that practice limited transparency. Models like GPT-4 already espouse dangerous positive stereotypes or unfairly benefit privileged groups due to representation issues in the feedback collection process [11, 36].

7 Acknowledgements

We thank Eric Zelikman, Matt Jörke, Jan-Philipp Fränken, Michael Y. Li, Michael Ryan, Will Held, Shan Rizvi, Suvir Mirchandani, and Jensen Gao for helpful discussions and feedback. We also thank members of the SALT Lab and the Stanford HCI / NLP groups.

References

  • Akrour et al. [2012] R. Akrour, M. Schoenauer, and M. Sebag. April: Active preference learning-based reinforcement learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2012, Bristol, UK, September 24-28, 2012. Proceedings, Part II 23, pages 116–131. Springer, 2012.
  • Andukuri et al. [2024] C. Andukuri, J.-P. Fränken, T. Gerstenberg, and N. D. Goodman. Star-gate: Teaching language models to ask clarifying questions. arXiv preprint arXiv:2403.19154, 2024.
  • Azar et al. [2023] M. G. Azar, M. Rowland, B. Piot, D. Guo, D. Calandriello, M. Valko, and R. Munos. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
  • Bai et al. [2022] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  • Brown et al. [2019] D. Brown, W. Goo, P. Nagarajan, and S. Niekum. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International conference on machine learning, pages 783–792. PMLR, 2019.
  • Brown et al. [2020a] D. S. Brown, W. Goo, and S. Niekum. Better-than-demonstrator imitation learning via automatically-ranked demonstrations. In Conference on robot learning, pages 330–359. PMLR, 2020a.
  • Brown et al. [2020b] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020b.
  • Burns et al. [2023] C. Burns, P. Izmailov, J. H. Kirchner, B. Baker, L. Gao, L. Aschenbrenner, Y. Chen, A. Ecoffet, M. Joglekar, J. Leike, et al. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023.
  • Chakrabarty et al. [2023] T. Chakrabarty, P. Laban, D. Agarwal, S. Muresan, and C.-S. Wu. Art or artifice? large language models and the false promise of creativity. arXiv preprint arXiv:2309.14556, 2023.
  • Chen et al. [2024] Z. Chen, Y. Deng, H. Yuan, K. Ji, and Q. Gu. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
  • Cheng et al. [2023] M. Cheng, E. Durmus, and D. Jurafsky. Marked personas: Using natural language prompts to measure stereotypes in language models. arXiv preprint arXiv:2305.18189, 2023.
  • Chung et al. [2022] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  • Cypher [1991] A. Cypher. Eager: Programming repetitive tasks by example. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 33–39, 1991.
  • Cypher and Halbert [1993] A. Cypher and D. C. Halbert. Watch what I do: programming by demonstration. MIT press, 1993.
  • Fürnkranz et al. [2012] J. Fürnkranz, E. Hüllermeier, W. Cheng, and S.-H. Park. Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine learning, 89:123–156, 2012.
  • Gao et al. [2024] G. Gao, A. Taymanov, E. Salinas, P. Mineiro, and D. Misra. Aligning llm agents by learning latent preference from user edits. arXiv preprint arXiv:2404.15269, 2024.
  • Goldstein et al. [2008] J. Goldstein, K. Goodwin, R. Sabin, and R. Winder. Creating and using a correlated corpora to glean communicative commonalities. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), 2008.
  • Gulcehre et al. [2023] C. Gulcehre, T. L. Paine, S. Srinivasan, K. Konyushkova, L. Weerts, A. Sharma, A. Siddhant, A. Ahern, M. Wang, C. Gu, et al. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  • Hejna et al. [2024] J. Hejna, R. Rafailov, H. Sikchi, C. Finn, S. Niekum, W. B. Knox, and D. Sadigh. Contrastive preference learning: Learning from human feedback without reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=iX1RjVQODj.
  • Hewitt et al. [2024] J. Hewitt, S. Chen, L. L. Xie, E. Adams, P. Liang, and C. D. Manning. Model editing with canonical examples. arXiv preprint arXiv:2402.06155, 2024.
  • Ho and Ermon [2016] J. Ho and S. Ermon. Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.
  • Hu et al. [2021] E. J. Hu, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  • Jiang et al. [2023] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  • Kirk et al. [2023] H. R. Kirk, B. Vidgen, P. Röttger, and S. A. Hale. Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback. arXiv preprint arXiv:2303.05453, 2023.
  • Kwon et al. [2023] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
  • Lee et al. [2023] H. Lee, S. Phatale, H. Mansoor, K. Lu, T. Mesnard, C. Bishop, V. Carbune, and A. Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
  • Lewis et al. [2004] D. D. Lewis, Y. Yang, T. Russell-Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research, 5(Apr):361–397, 2004.
  • Mishra et al. [2022] S. Mishra, D. Khashabi, C. Baral, and H. Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, 2022.
  • Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  • Nakano et al. [2021] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  • Ng et al. [1999] A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward sha**. In Icml, volume 99, pages 278–287, 1999.
  • Ouyang et al. [2022] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  • Panickssery et al. [2024] A. Panickssery, S. R. Bowman, and S. Feng. Llm evaluators recognize and favor their own generations. arXiv preprint arXiv:2404.13076, 2024.
  • Rafailov et al. [2023] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  • Rafailov et al. [2024] R. Rafailov, J. Hejna, R. Park, and C. Finn. From r𝑟ritalic_r to qsuperscript𝑞q^{*}italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT: Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358, 2024.
  • Ryan et al. [2024] M. J. Ryan, W. Held, and D. Yang. Unintended impacts of llm alignment on global representation. arXiv preprint arXiv:2402.15018, 2024.
  • Santurkar et al. [2023] S. Santurkar, E. Durmus, F. Ladhak, C. Lee, P. Liang, and T. Hashimoto. Whose opinions do language models reflect? arXiv preprint arXiv:2303.17548, 2023.
  • Schulman et al. [2017] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Sikchi et al. [2022] H. Sikchi, A. Saran, W. Goo, and S. Niekum. A ranking game for imitation learning, 2022. URL https://openreview.net/forum?id=I59qJ0sJ2nh.
  • Stephan et al. [2024] M. Stephan, A. Khazatsky, E. Mitchell, A. S. Chen, S. Hsu, A. Sharma, and C. Finn. Rlvf: Learning from verbal feedback without overgeneralization. arXiv preprint arXiv:2402.10893, 2024.
  • Stiennon et al. [2020] N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, and P. Christiano. Learning to summarize from human feedback. arXiv preprint arXiv:2009.01325, 2020.
  • Thoppilan et al. [2022] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. **, T. Bos, L. Baker, Y. Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  • Watson et al. [2023] J. Watson, S. H. Huang, and N. Heess. Coherent soft imitation learning. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
  • Wei et al. [2022] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  • Williams [1992] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
  • Wu and Hu [2018] Y. Wu and B. Hu. Learning to extract coherent summary via deep reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  • Wu et al. [2024] Z. Wu, Y. Hu, W. Shi, N. Dziri, A. Suhr, P. Ammanabrolu, N. A. Smith, M. Ostendorf, and H. Hajishirzi. Fine-grained human feedback gives better rewards for language model training. Advances in Neural Information Processing Systems, 36, 2024.
  • Yuan et al. [2024] W. Yuan, R. Y. Pang, K. Cho, S. Sukhbaatar, J. Xu, and J. Weston. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
  • Zamfirescu-Pereira et al. [2023] J. Zamfirescu-Pereira, R. Y. Wong, B. Hartmann, and Q. Yang. Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–21, 2023.
  • Zelikman et al. [2022] E. Zelikman, Y. Wu, J. Mu, and N. Goodman. Star: Bootstrap** reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022.
  • Zelikman et al. [2024] E. Zelikman, G. Harik, Y. Shao, V. Jayasiri, N. Haber, and N. D. Goodman. Quiet-star: Language models can teach themselves to think before speaking, 2024.
  • Zhao et al. [2023] S. Zhao, J. Dang, and A. Grover. Group preference optimization: Few-shot alignment of large language models. arXiv preprint arXiv:2310.11523, 2023.
  • Zheng et al. [2024] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
  • Zhou et al. [2024] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024.
  • Zhou et al. [2022] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2022.
  • Ziebart [2010] B. D. Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010.
  • Ziebart et al. [2008] B. D. Ziebart, A. L. Maas, J. A. Bagnell, A. K. Dey, et al. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
  • Ziegler et al. [2019] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.

Appendix A Deriving DITTO as Online Imitation Learning

For understanding the provided derivations, it is helpful to be familiar with the fixed point solution for Eq. 1, which was first derived for maximum entropy RL [56].

Q(x,y)superscript𝑄𝑥𝑦\displaystyle Q^{*}(x,y)italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x , italic_y ) =r(x,y)(because contextual bandit)absent𝑟𝑥𝑦(because contextual bandit)\displaystyle=r(x,y)\quad\text{(because contextual bandit) }= italic_r ( italic_x , italic_y ) (because contextual bandit)
V(x)superscript𝑉𝑥\displaystyle V^{*}(x)italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) =αlog𝔼yπref(|x)[er(x,y)/α]\displaystyle=\alpha\log\mathbb{E}_{y\sim\pi_{\textrm{ref}}(\cdot|x)}\left[e^{% r(x,y)/\alpha}\right]= italic_α roman_log blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( ⋅ | italic_x ) end_POSTSUBSCRIPT [ italic_e start_POSTSUPERSCRIPT italic_r ( italic_x , italic_y ) / italic_α end_POSTSUPERSCRIPT ]
π(y|x)superscript𝜋conditional𝑦𝑥\displaystyle\pi^{*}(y|x)italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) =πref(y|x)e(r(x,y)V(x))/α=1Z(x)πref(y|x)er(x,y)/αabsentsubscript𝜋refconditional𝑦𝑥superscript𝑒𝑟𝑥𝑦superscript𝑉𝑥𝛼1𝑍𝑥subscript𝜋refconditional𝑦𝑥superscript𝑒𝑟𝑥𝑦𝛼\displaystyle=\pi_{\textrm{ref}}(y|x)e^{(r(x,y)-V^{*}(x))/\alpha}=\frac{1}{Z(x% )}\pi_{\textrm{ref}}(y|x)e^{r(x,y)/\alpha}= italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) italic_e start_POSTSUPERSCRIPT ( italic_r ( italic_x , italic_y ) - italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) ) / italic_α end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_Z ( italic_x ) end_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) italic_e start_POSTSUPERSCRIPT italic_r ( italic_x , italic_y ) / italic_α end_POSTSUPERSCRIPT

where Z(x)=eV(x)/α=𝔼yπref(|x)[er(x,y)/α]Z(x)=e^{V^{*}(x)/\alpha}=\mathbb{E}_{y\sim\pi_{\textrm{ref}}(\cdot|x)}\left[e^% {r(x,y)/\alpha}\right]italic_Z ( italic_x ) = italic_e start_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) / italic_α end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( ⋅ | italic_x ) end_POSTSUBSCRIPT [ italic_e start_POSTSUPERSCRIPT italic_r ( italic_x , italic_y ) / italic_α end_POSTSUPERSCRIPT ]. Using this information, in conjunction with Equation 1, we can a number of useful inequalities between πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, πrefsubscript𝜋ref\pi_{\textrm{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT, and an arbitrary π𝜋\piitalic_π.

A.1 Deriving DITTO

Here we provide a more detailed derivation of DITTO from an online imtiation learning perspective. In particular, we consider the common two-player min-max interpretation of imitation learning [57, 39], but do so with general objective functions.

minr(𝒟π,r)maxπ𝒥(π,r)subscript𝑟superscript𝒟𝜋𝑟subscript𝜋𝒥𝜋𝑟\min_{r}\mathcal{L}(\mathcal{D}^{\pi},r)\quad\max_{\pi}{\mathcal{J}}(\pi,r)roman_min start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT caligraphic_L ( caligraphic_D start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT , italic_r ) roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT caligraphic_J ( italic_π , italic_r )

In this formulation, 𝒟πsuperscript𝒟𝜋\mathcal{D}^{\pi}caligraphic_D start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT is a dataset of preferences such that yπyπ|xsucceeds-or-equalssuperscript𝑦𝜋conditionalsuperscript𝑦superscript𝜋𝑥y^{\pi}\succeq y^{\pi^{\prime}}|xitalic_y start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ⪰ italic_y start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | italic_x if 𝔼π[r(x,y)]𝔼π[r(x,y)]subscript𝔼𝜋delimited-[]𝑟𝑥𝑦subscript𝔼superscript𝜋delimited-[]𝑟𝑥𝑦\mathbb{E}_{\pi}[r(x,y)]\geq\mathbb{E}_{\pi^{\prime}}[r(x,y)]blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] ≥ blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ], i.e. one completion is preferred to another if the corresponding policy has higher expected reward. This framework generalizes prior work. For example, we limit ourselves to only comparing the expert policy πEsubscript𝜋𝐸\pi_{E}italic_π start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT to the current policy π𝜋\piitalic_π, and add a regularizer, we can obtain the maximum entropy IRL objective from Ho and Ermon [21]. Choosing 𝒥KLsubscript𝒥KL\mathcal{J}_{\text{\tiny{KL}}}caligraphic_J start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT as the policy objective function and maximum likelihood on the Bradley-Terry model as the reward objective we get the following optimization:

minr𝔼(x,yw,yl)𝒟π[logσ(r(x,yw)r(x,yl))],maxπ𝒥KL(π,r)subscript𝑟subscript𝔼similar-to𝑥superscript𝑦𝑤superscript𝑦𝑙subscript𝒟𝜋delimited-[]𝜎𝑟𝑥superscript𝑦𝑤𝑟𝑥superscript𝑦𝑙subscript𝜋subscript𝒥KL𝜋𝑟\min_{r}-\mathbb{E}_{(x,y^{w},y^{l})\sim\mathcal{D}_{\pi}}\left[\log\sigma(r(x% ,y^{w})-r(x,y^{l}))\right],\quad\max_{\pi}\mathcal{J}_{\text{\tiny{KL}}}(\pi,r)roman_min start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ∼ caligraphic_D start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_r ( italic_x , italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ) - italic_r ( italic_x , italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ) ] , roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_π , italic_r )

where 𝒥KLsubscript𝒥KL\mathcal{J}_{\text{\tiny{KL}}}caligraphic_J start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT is the KL-constrained RL objectve from before, but now dependent on the learned reward function. We then select an ordering for the optimization, by making policy learning the “inner” objective as done in Ho and Ermon [21]. Sikchi et al. [39] makes connections between this choice and game theory. This results in the same equation in the main paper, repeated here for clarity.

minr{𝔼(x,yw,yl)𝒟π[logσ(r(x,yw)r(x,yl))] s.t. π=argmaxπ𝒥KL(π,r)},subscript𝑟subscript𝔼similar-to𝑥superscript𝑦𝑤superscript𝑦𝑙subscript𝒟𝜋delimited-[]𝜎𝑟𝑥superscript𝑦𝑤𝑟𝑥superscript𝑦𝑙 s.t. 𝜋subscript𝜋subscript𝒥KL𝜋𝑟\min_{r}\left\{-\mathbb{E}_{(x,y^{w},y^{l})\sim\mathcal{D}_{\pi}}\left[\log% \sigma(r(x,y^{w})-r(x,y^{l}))\right]\text{ s.t. }\pi=\arg\max_{\pi}\mathcal{J}% _{\text{\tiny{KL}}}(\pi,r)\right\},roman_min start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT { - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ∼ caligraphic_D start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_r ( italic_x , italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ) - italic_r ( italic_x , italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ) ] s.t. italic_π = roman_arg roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_π , italic_r ) } ,

We can then re-arrange the fixed point equations from maximum entropy RL, obtaining the “DPO-trick”:

r(x,y)=αlogπ(y|x))πref(y|x)αlogZ(x).r(x,y)=\alpha\log\tfrac{\pi^{*}(y|x))}{\pi_{\textrm{ref}}(y|x)}-\alpha\log Z(x).italic_r ( italic_x , italic_y ) = italic_α roman_log divide start_ARG italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG - italic_α roman_log italic_Z ( italic_x ) .

This alone, however, is insufficient to obtain a representation for the optimal policy as naively substituting the above does not garuntee that the domain of reward functions can be fully expressed by such a reparameterization in terms of the policy. Fortunately, prior work have established both that such a reparatermization is equally expressive [43, 31] and that it does not affect the preference model [19, 35]. Completing this substitution yields the main DITTO objective.

However, DITTO is compatible with other algorithms, such as traditional RL methods, so long as they can be used to solve for the KL-constrained RL objective in Eq. 1. Instead of using the DPO trick, one could use a few steps of a policy gradient algorithm to update the policy.

Distributional versus Point-wise Preferences. One thing to note is that we construct preferences for DITTO from distributional preferences, ie 𝔼π1[r(x,y)]𝔼π2[r(x,y)]subscript𝔼subscript𝜋1delimited-[]𝑟𝑥𝑦subscript𝔼subscript𝜋2delimited-[]𝑟𝑥𝑦\mathbb{E}_{\pi_{1}}[r(x,y)]\geq\mathbb{E}_{\pi_{2}}[r(x,y)]blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] ≥ blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ]. However, this only guarantees that completions from one policy are preferred to another in expectation, not necessarily that every realized preference pair follows this relationship. We found that his choice works well in practice, and is actually common in prior work. For example, Brown et al. [6] uses a sequence of policies ranked by expected return in combination with a Bradley-Terry model. Appendix C of Stephan et al. [40] shows that artificially sampling comparisons between two policies is consistent with a Bradley-Terry reward model. Another possible view of this is that DITTO ends up optimizing an upper bound on the standard reward modeling loss:

𝔼πw,πl𝒟π[logσ(𝔼yπw[r(x,y)]𝔼yπl[r(x,y)])]𝔼πw,πl𝒟π[logσ(r(x,yw)r(x,yl)]\mathbb{E}_{\pi^{w},\pi^{l}\sim\mathcal{D}^{\pi}}[-\log\sigma(\mathbb{E}_{y% \sim\pi^{w}}[r(x,y)]-\mathbb{E}_{y\sim\pi^{l}}[r(x,y)])]\leq\mathbb{E}_{\pi^{w% },\pi^{l}\sim\mathcal{D}^{\pi}}[-\log\sigma(r(x,y^{w})-r(x,y^{l})]blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ - roman_log italic_σ ( blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] - blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] ) ] ≤ blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ - roman_log italic_σ ( italic_r ( italic_x , italic_y start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ) - italic_r ( italic_x , italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ]

which arises from applying Jensen’s inequality on the negative log-sigmoid function.

A.2 Online Imitation Can Perform Better than SFT

Here we show that, under some circumstances, online imitation learning is theoretically able to perform better than SFT on the expert dataset. To do this, we require a few building blocks.

Proposition A.1.

The objective value 𝒥KLsubscript𝒥KL\mathcal{J}_{\text{\tiny{KL}}}caligraphic_J start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT of any policy π𝜋\piitalic_π can be expressed in terms of the optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as 𝒥KL(π)=𝒥KL(π)α𝔼xp[DKL(π(|x||π(|x))]\mathcal{J}_{\text{\tiny{KL}}}(\pi)=\mathcal{J}_{\text{\tiny{KL}}}(\pi^{*})-% \alpha\mathbb{E}_{x\sim p}\left[D_{\text{\tiny{KL}}}\left(\pi(\cdot|x||\pi^{*}% (\cdot|x)\right)\right]caligraphic_J start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_π ) = caligraphic_J start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_α blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_π ( ⋅ | italic_x | | italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ | italic_x ) ) ]

Proof. Note that at convergence, the optimal policy obeys the equality π(y|x)=πref(y|x)e(r(x,y)V(x))/αsuperscript𝜋conditional𝑦𝑥subscript𝜋refconditional𝑦𝑥superscript𝑒𝑟𝑥𝑦superscript𝑉𝑥𝛼\pi^{*}(y|x)=\pi_{\textrm{ref}}(y|x)e^{(r(x,y)-V^{*}(x))/\alpha}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) = italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) italic_e start_POSTSUPERSCRIPT ( italic_r ( italic_x , italic_y ) - italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) ) / italic_α end_POSTSUPERSCRIPT. Thus, we can rewrite the reward function in terms of the optimal policy as

r(x,y)=αlogπ(y|x)πref(y|x)+V(x)𝑟𝑥𝑦𝛼superscript𝜋conditional𝑦𝑥subscript𝜋refconditional𝑦𝑥superscript𝑉𝑥r(x,y)=\alpha\log\frac{\pi^{*}(y|x)}{\pi_{\textrm{ref}}(y|x)}+V^{*}(x)italic_r ( italic_x , italic_y ) = italic_α roman_log divide start_ARG italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG + italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x )

and substitute it into the objective function for the reward.

𝒥(π)𝒥𝜋\displaystyle{\mathcal{J}}(\pi)caligraphic_J ( italic_π ) =𝔼yπ(|x),xp[r(x,y)αlogπ(y|x)πref(y|x)]\displaystyle=\mathbb{E}_{y\sim\pi(\cdot|x),x\sim p}\left[r(x,y)-\alpha\log% \frac{\pi(y|x)}{\pi_{\textrm{ref}}(y|x)}\right]= blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π ( ⋅ | italic_x ) , italic_x ∼ italic_p end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) - italic_α roman_log divide start_ARG italic_π ( italic_y | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG ]
=𝔼yπ(|x),xp[αlogπ(y|x)πref(y|x)+V(x)αlogπ(y|x)πref(y|x)]\displaystyle=\mathbb{E}_{y\sim\pi(\cdot|x),x\sim p}\left[\alpha\log\frac{\pi^% {*}(y|x)}{\pi_{\textrm{ref}}(y|x)}+V^{*}(x)-\alpha\log\frac{\pi(y|x)}{\pi_{% \textrm{ref}}(y|x)}\right]= blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π ( ⋅ | italic_x ) , italic_x ∼ italic_p end_POSTSUBSCRIPT [ italic_α roman_log divide start_ARG italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG + italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) - italic_α roman_log divide start_ARG italic_π ( italic_y | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG ]
=𝔼yπ(|x),xp[αlogπ(y|x)π(y|x)+V(x)]\displaystyle=\mathbb{E}_{y\sim\pi(\cdot|x),x\sim p}\left[\alpha\log\frac{\pi^% {*}(y|x)}{\pi(y|x)}+V^{*}(x)\right]= blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π ( ⋅ | italic_x ) , italic_x ∼ italic_p end_POSTSUBSCRIPT [ italic_α roman_log divide start_ARG italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) end_ARG start_ARG italic_π ( italic_y | italic_x ) end_ARG + italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) ]
=𝔼xp[V(x)]𝔼yπ(|x),xp[αlogπ(y|x)π(y|x)]\displaystyle=\mathbb{E}_{x\sim p}\left[V^{*}(x)\right]-\mathbb{E}_{y\sim\pi(% \cdot|x),x\sim p}\left[\alpha\log\frac{\pi(y|x)}{\pi^{*}(y|x)}\right]= blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) ] - blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π ( ⋅ | italic_x ) , italic_x ∼ italic_p end_POSTSUBSCRIPT [ italic_α roman_log divide start_ARG italic_π ( italic_y | italic_x ) end_ARG start_ARG italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) end_ARG ]
=𝒥(π)α𝔼xp[DKL(π(|x)||π(|x))]\displaystyle={\mathcal{J}}(\pi^{*})-\alpha\mathbb{E}_{x\sim p}\left[D_{\text{% \tiny{KL}}}\left(\pi(\cdot|x)||\pi^{*}(\cdot|x)\right)\right]= caligraphic_J ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_α blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_π ( ⋅ | italic_x ) | | italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ | italic_x ) ) ]

This also implies that πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is unique (though this is known to be true of MaxEnt RL objectives). This means that provided the reference policy is not already optimal, DITTO is able to improve it.

Corollary A.2.

Given πrefπsubscript𝜋refsuperscript𝜋\pi_{\textrm{ref}}\neq\pi^{*}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ≠ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, then 𝒥(π)>𝒥(πref)𝒥superscript𝜋𝒥subscript𝜋ref{\mathcal{J}}(\pi^{*})>{\mathcal{J}}(\pi_{\textrm{ref}})caligraphic_J ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) > caligraphic_J ( italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ).

This follows by considering proposition 1 in conjunction with the fact that 𝒥(π)𝒥(πref)𝒥superscript𝜋𝒥subscript𝜋ref{\mathcal{J}}(\pi^{*})\geq{\mathcal{J}}(\pi_{\textrm{ref}})caligraphic_J ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ caligraphic_J ( italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ) and the KL-divergence is only zero if both distributions are equal.

Lemma A.3.

(Adapted from Theorem 1 of Brown et al. [6]) Let πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be the optimal policy for Eq. 1 and π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG be the policy estimated by DITTO using expert demonstrations 𝒟Esubscript𝒟𝐸\mathcal{D}_{E}caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT. Extrapolation beyond the demonstrator, i.e. 𝔼yπ^(|x),xp[r(x,y)]>𝔼x,y𝒟E[r(x,y)]\mathbb{E}_{y\sim\hat{\pi}(\cdot|x),x\sim p}[r(x,y)]>\mathbb{E}_{x,y\sim% \mathcal{D}_{E}}[r(x,y)]blackboard_E start_POSTSUBSCRIPT italic_y ∼ over^ start_ARG italic_π end_ARG ( ⋅ | italic_x ) , italic_x ∼ italic_p end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] > blackboard_E start_POSTSUBSCRIPT italic_x , italic_y ∼ caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] is guaranteed if

𝒥KL(π)𝔼𝒟E[r(x,y)]>α𝔼xp[DKL(π^(|x)||π(|x))]α𝔼xp[DKL(π^(|x)||πref(|x))].\mathcal{J}_{\text{\tiny{KL}}}(\pi^{*})-\mathbb{E}_{\mathcal{D}_{E}}[r(x,y)]>% \alpha\mathbb{E}_{x\sim p}\left[D_{\text{\tiny{KL}}}\left(\hat{\pi}(\cdot|x)||% \pi^{*}(\cdot|x)\right)\right]-\alpha\mathbb{E}_{x\sim p}\left[D_{\text{\tiny{% KL}}}\left(\hat{\pi}(\cdot|x)||\pi_{\textrm{ref}}(\cdot|x)\right)\right].caligraphic_J start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - blackboard_E start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] > italic_α blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ( ⋅ | italic_x ) | | italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ | italic_x ) ) ] - italic_α blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ( ⋅ | italic_x ) | | italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( ⋅ | italic_x ) ) ] .

Proof. This can be shown via simple sequence of inequalities and application of proposition 1. For brevity, we will omit the expectations over the prompt distribution. We proceed directly.

𝔼π^[r(x,y)]subscript𝔼^𝜋delimited-[]𝑟𝑥𝑦\displaystyle\mathbb{E}_{\hat{\pi}}[r(x,y)]blackboard_E start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] >𝔼𝒟E[r(x,y)]absentsubscript𝔼subscript𝒟𝐸delimited-[]𝑟𝑥𝑦\displaystyle>\mathbb{E}_{\mathcal{D}_{E}}[r(x,y)]> blackboard_E start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ]
𝒥KL(π^)subscript𝒥KL^𝜋\displaystyle\mathcal{J}_{\text{\tiny{KL}}}(\hat{\pi})caligraphic_J start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) >𝔼𝒟E[r(x,y)]αDKL(π^||πref)\displaystyle>\mathbb{E}_{\mathcal{D}_{E}}[r(x,y)]-\alpha D_{\text{\tiny{KL}}}% \left(\hat{\pi}||\pi_{\textrm{ref}}\right)> blackboard_E start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] - italic_α italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG | | italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT )
𝒥KL(π)𝒥KL(π)+𝒥KL(π^)subscript𝒥KLsuperscript𝜋subscript𝒥KLsuperscript𝜋subscript𝒥KL^𝜋\displaystyle\mathcal{J}_{\text{\tiny{KL}}}(\pi^{*})-\mathcal{J}_{\text{\tiny{% KL}}}(\pi^{*})+\mathcal{J}_{\text{\tiny{KL}}}(\hat{\pi})caligraphic_J start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - caligraphic_J start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + caligraphic_J start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG ) >𝔼𝒟E[r(x,y)]αDKL(π^||πref)\displaystyle>\mathbb{E}_{\mathcal{D}_{E}}[r(x,y)]-\alpha D_{\text{\tiny{KL}}}% \left(\hat{\pi}||\pi_{\textrm{ref}}\right)> blackboard_E start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] - italic_α italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG | | italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT )
𝒥KL(π)αDKL(π^||π)\displaystyle\mathcal{J}_{\text{\tiny{KL}}}(\pi^{*})-\alpha D_{\text{\tiny{KL}% }}\left(\hat{\pi}||\pi^{*}\right)caligraphic_J start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_α italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG | | italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) >𝔼𝒟E[r(x,y)]αDKL(π^||πref)\displaystyle>\mathbb{E}_{\mathcal{D}_{E}}[r(x,y)]-\alpha D_{\text{\tiny{KL}}}% \left(\hat{\pi}||\pi_{\textrm{ref}}\right)> blackboard_E start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] - italic_α italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG | | italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT )
𝒥KL(π)𝔼𝒟E[r(x,y)]subscript𝒥KLsuperscript𝜋subscript𝔼subscript𝒟𝐸delimited-[]𝑟𝑥𝑦\displaystyle\mathcal{J}_{\text{\tiny{KL}}}(\pi^{*})-\mathbb{E}_{\mathcal{D}_{% E}}[r(x,y)]caligraphic_J start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - blackboard_E start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] >αDKL(π^||π)αDKL(π^||πref)\displaystyle>\alpha D_{\text{\tiny{KL}}}\left(\hat{\pi}||\pi^{*}\right)-% \alpha D_{\text{\tiny{KL}}}\left(\hat{\pi}||\pi_{\textrm{ref}}\right)> italic_α italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG | | italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_α italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG | | italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT )

If one wants to directly compare expected rewards, the αDKL(π||πref)-\alpha D_{\text{\tiny{KL}}}\left(\pi^{*}||\pi_{\textrm{ref}}\right)- italic_α italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ) term in 𝒥KL(π)subscript𝒥KLsuperscript𝜋\mathcal{J}_{\text{\tiny{KL}}}(\pi^{*})caligraphic_J start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) can simply be moved to the right hand side of the inequality. In practice, we choose a fairly small value of α𝛼\alphaitalic_α. This means that if the objective value of our optimal policy (reward minus KL) is higher than the average reward of the dataset, then we expect to do better than the demonstrator when our learned policy is closer to the optimal one than the reference.

Appendix B Dataset Details

Source Author Train / Author Val / Author Test / Author
CMCC 10 7 2-3 2-3
CCAT 10 7 3 3
Table 4: Final Aggregate Benchmark Statistics

In all, we collect data from a total of 20 distinct authors from two sources: (1) CMCC consists of texts written by 21 students in six different genres (email, essay, interview transcript, blog article, chat, or discussion transcript) covering six different controversial topics [17]. We filter this corpus to include only emails and blog posts, excluding sources where multiple individuals were involved (e.g., chat). (2) CCAT [27] consists of articles from Canadian Broadcasting Corporation’s French Service, sourced from RCV1-v2 Reuters Corpus dataset. Due to the large number of training paradigms evaluated in this work, we sample articles from 10 authors from each dataset (260 documents total). Table 4 highlights raw counts for each author.

Appendix C Hyperparameters and Training Details

We run a random hyperparameter sweep over a single, randomly selected author from each corpus, using lr = {1e4,3e4,1e5,3e5,1e6,3e6}1𝑒43𝑒41𝑒53𝑒51𝑒63𝑒6\{1e-4,3e-4,1e-5,3e-5,1e-6,3e-6\}{ 1 italic_e - 4 , 3 italic_e - 4 , 1 italic_e - 5 , 3 italic_e - 5 , 1 italic_e - 6 , 3 italic_e - 6 }, epoch = {10,15,20,25,30}1015202530\{10,15,20,25,30\}{ 10 , 15 , 20 , 25 , 30 }, and β={0.01,0.05,0.1}𝛽0.010.050.1\beta=\{0.01,0.05,0.1\}italic_β = { 0.01 , 0.05 , 0.1 }. We additionally tune how frequently DITTO samples negatives (K={1,5,10}𝐾1510K=\{1,5,10\}italic_K = { 1 , 5 , 10 }); and how many negatives DITTO samples (M={1,5,10}𝑀1510M=\{1,5,10\}italic_M = { 1 , 5 , 10 }). We fix optimal hyperparameters for each benchmark across all our remaining evaluations. We select hyperparameters from searches conducted on the validation set. All training was conducted on 1 A100 80GB GPU. We use the cosine scheduler for the SFT training step, with a warmup ratio of 0.1; and the constant_with_warmup scheduler for DPO with a warmup ratio of 0.25. For a dataset, we train with SFT until BCE train loss on a given batch approaches 1.00 (early stop**); ideally, we want an LLM to not overfit entirely to demos before the DPO step. Finally, we use AdamW across all experiments.

Dataset CMCC CCAT
LoRA Rank 16 16
  LoRA Alpha 32 32
SFT Batch Size 4 4
  SFT Learning Rate 3e-5 3e-5
DPO Batch Size \approx 24 \approx 24
DPO Learning Rate 1e-6 1e-6
DPO Grad Steps 40 40
DPO β𝛽\betaitalic_β 0.05 0.05
DITTO Negative Samples 10 10
  DITTOResample Step-Rate 10 10
  DITTOResample Temperature 1.0 1.0
  DITTOFrac Replay 0.2 0.2
  DITTOFrac Expert 0.7 0.7
  DITTOFrac Inter-model 0.1 0.1
Table 5: Hyperparameters across benchmark datasets.

Appendix D GPT-eval Prompts

We outline our final evaluation prompt below. We re-prompted for every pair of conditions, swapped generation orders to account for positional bias, and computed an averaged win rate. We sample with temperature = 0.0 for eval, and use GPT-4 0613.

System: You are an impartial evaluator. You are an impartial evaluator. Below is a sample of a human author’s writing and two options. ### HUMAN AUTHOR’S WRITING: {demo} ### OUTPUT A: {text_a} ### OUTPUT B: {text_b} ### Task Which option was written by the human author based on similarity to the HUMAN AUTHOR’S WRITING above? Respond only with a JSON of the following format: {
xx"answer": "<The option most similar to the HUMAN AUTHOR’S WRITING; either A or B>"
}
ALWAYS REMAIN IMPARTIAL WHEN EVALUATING OUTPUTS.

Appendix E User Study Details and Example Demonstrations

E.1 User Study Interface

Our interface consists of two parts: a data collection phase where we solicit tasks (Fig. 4) and demonstrations (Fig. 5) from users; and a preference elicitation phase (Fig. 6) where we ask individuals to select between pairwise generations across baselines.

Refer to caption
Figure 4: Task Elicitation Screenshot from the User Study. Individuals can either select GPT-4 generated prompts, or write their own.
Refer to caption
Figure 5: Demonstration Collection Screenshot. In this step, users provide demonstrations for 2 of the 4 selected prompts. We give GPT-4 zero-shot completions so that users can edit or rewrite to their liking. In the screenshot above, we show the GPT-4 completion before edits.
Refer to caption
Figure 6: Pairwise Preference Screenshot. In the final step, users are asked to rank 20 pairwise preferences comparing all training paradigms.

E.2 User Study Task Generation Prompts

While users can provide their own tasks as part of the data collection phase, we also generate prompts to make the process easier. Below, we outline the prompt for task generation. We replace {domain} and {audience} based on the input from the user.

Please generate {n_tasks} ChatGPT writing prompts for the domain of "{domain}" and an audience of "{audience}". EXAMPLES: Here are some examples of writing prompts based on a domain of "Writing email" and an audience of "personal friends": Write an email to a friend inviting them to dinner.
Write an email to a friend asking for advice.
Write an email to a friend sharing a recent accomplishment.
Like our examples, make sure the new tasks you generate will result in writing with the same style or values (e.g. all casual, all formal, all honest, etc.) Respond with ONLY a JSON with the following format: {
xx"tasks": [
xxxx"<Task 1>",
xxxx"<Task 2>",
xxxx"<Task 3>",
xxxx...
xx]
}

E.3 Examples of Demonstrated Feedback

We include some examples of demonstrated feedback to highlight the types of changes users make.

E.3.1 User A

Task Prompt

Write an email to a friend discussing a recent movie or TV show you watched.

Original GPT-4 Generation

Subject: Let’s dish about the latest movie we watched!

Hey [Friend’s Name],

I hope you’re doing well! I wanted to chat with you about the movie I just watched last night - "Inception". Have you seen it yet? It’s a mind-bending thriller that kept me on the edge of my seat the whole time.

The storyline is so unique and complex, with layers of dreams within dreams that kept me guessing until the very end. The visuals and special effects were absolutely stunning, and the acting was top-notch. Leonardo DiCaprio was fantastic in his role, as always.

I don’t want to give too much away in case you haven’t seen it yet, but I highly recommend it if you’re into thought-provoking films that make you question reality. Let me know if you’ve seen it and what your thoughts are - I’d love to hear your take on it!

Looking forward to catching up with you soon.

Take care, [Your Name]

User’s Demonstration

Subject: WTF even happens in Tenet?

I know you like Inception, but your boy Nolan’s newest movie is Tenet and OMG I have no idea what happens. Like they are traveling through time, but not like normal time travel, they just move backwards through time. Like everything happens as if a movie were played in reverse. So, like, when people who are moving forward in time (the way we do in real life) see the people who are moving backward in time they actually look like they’re moving in reverse. And then kenneth branagh is like somehow communicating with the future and building a weapon to kill everyone for unknown reasons. I have watched it three times and I literally still have no idea what happens. But the special effects are so dope I just keep rewatching it, lol. N-E-WAYZ watch it and tell me what you think mmmm-kay?

E.3.2 User B

Task Prompt

Email a faculty member requesting guidance on preparing an effective research mentorship application.

Original GPT-4 Generation

Subject: Request for Guidance on Research Mentorship Application Preparation

Dear [Faculty Member’s Name],

I hope this email finds you well. I am reaching out to seek your guidance and advice on preparing an effective research mentorship application.

I am interested in applying for a research mentorship opportunity [specific program or opportunity if applicable] and I believe that your expertise and experience could greatly enhance my application. I have a strong interest in [briefly mention your research interests or specific area of study] and I am excited about the prospect of working with you to further develop my research skills.

I would greatly appreciate it if you could provide me with insights on what makes a successful research mentorship application, including key elements to focus on, potential strategies to highlight my strengths, and any other advice you may have based on your experience in this field.

I am committed to dedicating the necessary time and effort to submit a competitive application and value your guidance in this process. Your support and mentorship would be invaluable to me as I strive to further my research goals and academic pursuits.

Thank you in advance for considering my request. I look forward to your advice and suggestions to help me craft a strong research mentorship application.

Warm regards,

[Your Name] [Your Contact Information]

User’s Demonstration

Subject: Request for Guidance on Research Mentorship Application Preparation

Dear [Faculty Member’s Name],

My name is [my name], a PhD student at [university name]. I am applying to [name of this faculty member’s mentorship program] for Summer 2024. I am excited about the possibility of participating in your institute’s program, and I am reaching out with a question about faculty participating in your program.

My research interests and prior experience span several areas: [briefly mention your research interests in a concise list or sentence]. However, these areas are relevant to several faculty within your institute – [list 2-3 names] – some of whom may not be advising students this summer through your program.

If possible, could you please let me know if any of these faculty are participating in your program in summer? I would love to apply if any of these faculty are accepting students.

Thank you!

[Your Name] [Your Contact Information]

E.4 Demonstrations for Sample Efficiency Task

Task Prompt

Write an email informing lab mates that we will be having ice cream this weekend as a lab social.

Demonstration #1

We are gonna get some EYE SCREAM this weekend at [place] for our social. It’s getting really friggin hot. Plus, you know, me and ice cream…

Whenever you get time: can you reply to me ASAP so I can have a good idea of what the count looks like? I’ll send some more details in a bit re time.

See ya’ll there!

[Name]

Demonstration #2

ATTENTION!!! VERY URGENT!!

Ice cream this weekend at [place]. We haven’t had a social in a bit; plus [person] is gonna join us too.

Lemme know if [time] works for you all! If not, we can figure something else out.

Be there or be a melted ice cream cone,

[Name]

 

Task Prompt

Write an email informing students that there will be no seminar next week.

Demonstration #1

Hey folks!

We won’t be having a seminar this week. Let me know if you have any questions for next week, though!

[Name]

Demonstration #2

Hi everyone!

Just a reminder that there won’t be a seminar this week. See you next week! As always, feel free to reach out if you have any questions about the seminar in general.

[Name]

Dataset
CMCC CCAT50 User Study
Write an approximately 500 word essay to the following prompt: The city of Denver has decided to legalize small amounts of marijuana for persons over 21. How do you feel about this? Write an article that starts with the following: Two of Japan’s most troubled banks took action on Tuesday to end their long-running financial problems, with one finding a merger partner and the other finalising a radical restructuring. Write an email to a friend sharing a new recipe you tried and loved.
Author-written
Marijuana is one of those things that’s still stigmatized for few legitimate reasons, and most of the reasons people feel a certain way about it are also true of alcohol. One huge, bone-crushingly important, monumental difference between the two is this: the government can and does effectively tax the manufacture and distribution of alcohol (and tobacco, for that matter). Denver has taken a step in the right direction, and maybe the only step that needs to be taken. Since people are given permission to carry only small quantities, theoretically this still legislates against large-scale wheeling and dealing while preventing cops from having to divert time and attention to minor drug busts, which are often a waste of resources […] […] The moves are seen as just the beginning of a drive by the the indebted banking industry to get ready for Japan’s emerging "Big Bang" era of financial deregulation. Hokkaido Takushoku Bank, Japan’s 10th-largest commercial bank, said it will merge with the smaller Hokkaido Bank Ltd and cease its overseas operations to become a "super-regional" bank. Such super-regionals use their local power base to build a large presence that rivals their big-city competitors. Meanwhile, Nippon Credit Bank Ltd, one of three long-term credit banks in Japan, was expected to hold a news conference later in the day on a restructuring that would include an injection of government money and bankruptcy filings […] Note: samples from the user study have different train prompts. We have no ground truth for the test prompts.

Write an email to a friend planning a virtual movie night.

Guess what we are doing on Saturday!?!? Having a virtual movie night!!!

Here are the details:
Date: [Date] […]

I will send you the link later. Tell Graham to come too. I’ve invited Bob and Sarah so I think we’ll have a good crew. Do not bail!! […]
DITTO
The Denver City Council recently voted to approve the legalization of small amounts of marijuana for residents 21 and over (The Denver Post). I absolutely support this decision. This is a move towards common sense law. It saves the police significant amounts of time and resources. I’d argue that the enforcement of all laws concerning drugs should take this angle. It’s also beneficial for other reasons (revenue to the city, taxpayer dollars spent more effectively, etc). Opponents frequently argue for the “children being present in the streets” theory. Fine, add a proviso […] […] Under that plan, the government vowed to fully privatise Japan’s troubled financial institutions by the 1998 business year. "The government will help with the mergers and will provide a stable environment in which the institutions can take steps toward becoming self-sustaining," Finance Minister Ryutaro Hashimoto told reporters after a meeting on Tuesday with executives of the two troubled institutions. The two had already taken the first necessary steps toward their "dying business" scenarios, in which they have almost no chance of standing on their own […] Hey friend! Hope you’re doing well. This weekend I tried out this new vegan recipe for lentil bolognese (lentil bolognesa?) and it was amazing!! If you try it, the sauce is actually made with lentils instead of ground meat. You have to cook it down for like 45 minutes (you can even make it the night before and let it sit) — it’s the most hands off thing, and it thickens the longer it simmers. Then just toss it with pasta, garlic bread and you’re good to go. Lentils are the best for filling in the place of ground meat, and this felt so satisfying as a meal. So I highly recommend, it’s on all the vegan cooking instagrams but there’s a million ways to make it […]
Few-shot GPT-4
I am certainly intrigued by the city of Denver’s recent decision to decriminalize small quantities of marijuana for individuals who have reached the legal drinking age. Given the myriad of implications this policy shift has for government, society, and individuals alike, it is certainly a development worthy of thoughtful examination. […] […] In what is seen as a significant development in the banking industry, stricken Sumitomo Trust & Banking firm announced a merger deal with established Sakura Bank Ltd. Meanwhile, Tokai Bank Ltd, grappling with similar issues, unveiled a comprehensive restructuring plan aiming to nurse the bank back to health over the next few years […] Subject: YOU NEED TO TRY THIS RECIPE!!! OMG! You have GOT to hear about this bomb-diggity recipe I just discovered! It’s a pasta dish from Italy’s homemade spaghetti aglio e olio!! YES, it sounds fancy but trust me it’s super simple, idiot-proof, and the finished product, it’s lip-smackingly DELISH! The main stuff you need are spaghetti, garlic, olive oil, chilli flakes (for the kick), and EVERYONE’s fav cheese Parmesan […]
SFT
The Mile High City has made headlines once again as the result of an ordinance passed earlier this year. Denver, Colorado’s main metro area, has made history by legalizing small amounts of marijuana for persons over the age of 21. The Denver Police Department now cannot levy charges against those found in possession of the substance in sizes considered “reasonable”; the law, however, bans the sale and consumption of marijuana in any publicly accessible vicinity. […] […] Yamatane Seimei First Real Estate Co Ltd will be the first major institution to be wound up in Japan since World War Two, when Japan’s wartime debt was liquidated, the firm said in a statement. The firm said it will request on Wednesday that the government order it into a special public liquidation scheme – which is supposed to be a last resort for rescued financial institutions. […] Hey friend, So last night I had some people over for dinner, and I cooked this amazing chicken recipe. I’m not gonna lie, it took a little bit of work, but it was totally worth it. It was a skillet chicken marsala, and it tasted ah-mazing. I cooked the chicken in butter until it was golden brown on the outside, and then I took it out and cooked some shallots and mushrooms in the pan. Then I added some marsala wine, chicken stock, and cream, and let it simmer for about 20 minutes. At the very last, I added the chicken back to the pan to cook through while the sauce reduced.
Table 6: Selected prompts and responses across datasets (CMCC, CCAT50, User Study) and methods (DITTO, SFT, Few-shot GPT-4) for all evaluated models. Compared to the author-written ground truth, we observe that SFT occasionally feels longwinded, or fails to capture quirks of the author’s writing. Meanwhile, few-shot GPT prompting yields outputs that sound "GPT-like" (myriad of implications, significant development, etc.), or over-does the author’s original style (i.e. is cringe). Qualitatively, DITTO generations best preserve the author’s voice and style.