HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: inconsolata
  • failed: scalerel
  • failed: eqparbox

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2305.16635v3 [cs.CL] 05 Apr 2024

Impossible Distillation for Paraphrasing and Summarization:
How to Make High-quality Lemonade out of Small, Low-quality Models

Jaehun Jung\dagger     Peter West\dagger     Liwei Jiang\dagger     Faeze Brahman\dagger\ddagger† ‡
Ximing Lunormal-†\dagger         Jillian Fishernormal-†\dagger         Taylor Sorensennormal-†\dagger         Ye** Choinormal-†normal-‡\dagger\ddagger† ‡
\daggerPaul G. Allen School of Computer Science & Engineering, University of Washington
\ddaggerAllen Institute for Artificial Intelligence    
[email protected]
Abstract

We present Impossible Distillation, a novel framework for paraphrasing and sentence summarization, that distills a high-quality dataset and model from a low-quality teacher that itself cannot perform these tasks. Unlike prior works that rely on an extreme-scale teacher model (e.g., GPT3) or task-specific architecture, we hypothesize and verify the paraphrastic proximity intrinsic to pre-trained LMs (e.g., GPT2), where paraphrases occupy a proximal subspace in the LM distribution. By identifying and distilling generations from these subspaces, Impossible Distillation produces a high-quality dataset and model even from GPT2-scale LMs. We evaluate our method on multiple benchmarks spanning unconstrained / syntax-controlled paraphrase generation and sentence summarization. Our model with 770M parameters consistently outperforms strong baselines, including models distilled from ChatGPT, and sometimes, even ChatGPT itself. Also, we find that our distilled dataset from 1.5B LMs exhibits higher diversity and fidelity than up to 13 times larger datasets.

Impossible Distillation for Paraphrasing and Summarization:
How to Make High-quality Lemonade out of Small, Low-quality Models


Jaehun Jungnormal-†\dagger     Peter Westnormal-†\dagger     Liwei Jiangnormal-†\dagger     Faeze Brahmannormal-†normal-‡\dagger\ddagger† ‡ Ximing Lunormal-†\dagger         Jillian Fishernormal-†\dagger         Taylor Sorensennormal-†\dagger         Ye** Choinormal-†normal-‡\dagger\ddagger† ‡ \daggerPaul G. Allen School of Computer Science & Engineering, University of Washington \ddaggerAllen Institute for Artificial Intelligence [email protected]

1 Introduction

Training a compact, yet performant model is a non-trivial challenge in modern NLP, even for classical tasks such as paraphrase generation and sentence summarization. While large-scale, high-quality data is central to this goal, human supervision is hard to scale; as such, research efforts have focused on training models with an unsupervised, automatically generated dataset. Common approaches include back-translation Wieting and Gimpel (2018) and auto-encoding Févry and Phang (2018), but are often limited in terms of corpus diversity and noisiness Hu et al. (2019a, b).

Alternatively, recent works propose to train a compact task model by distilling knowledge from gigantic language models (LLMs) West et al. (2022). As LLMs such as GPT3 – often multi-billion scale and instruction-tuned – are already competent in paraphrasing and summarizing sentences Cegin et al. (2023), a specialized model can be trained by simply imitating LLM generations Cegin et al. (2023); Xu et al. (2023). Despite with limitations (e.g., significant budget requirement for data collection), LLM distillation outperforms previous methods without human supervision, giving out an impression that powerful teacher LM is all we need to train a better student.

Refer to caption
Figure 1: Impossible Distillation develops upon paraphrastic proximity: LM’s tendency to encode paraphrases on a proximal subspace in its distribution.
Refer to caption
Figure 2: Overview of Impossible Distillation. Starting from low-quality LM (GPT2), we generate a data pool of input-output pairs leveraging paraphrastic proximity, filter it with off-the-shelf critics, and distill a student model on this data pool. By self-distilling the student model, we obtain a high-quality dataset and model for target task.

In this work, we envision a seemingly impossible alternative to LLM distillation: instead of an extreme-scale, frontier LLM (e.g., GPT3), can we start off with a small, off-the-shelf LM that itself cannot perform paraphrase generation or sentence summarization? We present Impossible Distillation, a novel framework to distill task-specialized dataset and model from GPT2-scale LMs. Our framework requires neither a strong LLM nor human-authored references, yet can distill high-quality paraphrases and summaries comparable to that of prompting the strongest LLMs.

The key observation behind our framework is that a sentence and its paraphrases tend to lie on a proximal subspace in the pretrained LM distribution – a property we call paraphrastic proximity (Fig. 1). In other words, by effectively reducing down the LM search space (e.g., by constraining the model with an informative context) toward the paraphrastic subspaces, we can encourage the model to generate multiple sequences that paraphrase each other. As shown in Fig. 2, we leverage this property by first constructing a data pool of (source, paraphrase) pairs by enumerating a batch of generations sampled given the context. Next, we filter the data pool with off-the-shelf critic models to keep only the pairs with high quality paraphrases, which we subsequently use to fine-tune a student LM. Finally, the student LM is further refined through self-distillation, where the model is trained on its own high-quality paraphrases; as a result, we obtain both a high-quality corpus and a compact, yet powerful model for paraphrasing. Moreover, as Impossible Distillation is grounded on the explicit evaluation of generated pairs, the framework generalizes to sentence summarization by simply re-defining the filters.

Experimental results show that Impossible Distillation is surprisingly effective, both in terms of the distilled data quality and model performance. We first evaluate the quality of our dataset by measuring the semantic fidelity, lexical diversity, and syntactic diversity against three state-of-the-art paraphrase corpora. We find that our dataset, as a purely synthetic corpus generated from 1.5B LMs, shows better metrics in all measures than state-of-the-art datasets: ParaBank Hu et al. (2019a) that is 13 times larger than ours and ChatGPT-Para Vorobev et al. (2023) generated by orders of magnitude larger ChatGPT. Furthermore, in benchmarks across three distinct tasks – unconstrained / syntax-controlled paraphrasing and sentence summarization, our model distilled from 1.5B LM outperforms competitive baselines, including both the task-specific methods and the models distilled from ChatGPT OpenAI (2022). In human evaluation, our model with 770M parameters is consistently preferred to the ChatGPT-distilled model, and sometimes, even ChatGPT itself.

2 Paraphrastic Proximity

We develop Impossible Distillation based on the observation of paraphrastic proximityi.e. when the LM decoding space is constrained with sufficiently informative context, the model can produce multiple generations that paraphrase each other. Notably, Meng et al. (2021) indirectly leverages paraphrastic proximity in context LMs – a set of encoder-decoder transformers pre-trained from scratch using specialized training objectives. We, on the other hand, show that paraphrastic proximity holds for off-the-shelf LMs such as GPT2, which we can make use of to distill a high-quality task model and dataset. In this section, we first verify this with GPT2-XL Radford et al. (2019).

While the exact distribution of all possible generations is intractable, we can obtain an approximation by sampling a large number of generations given a contextual constraint. Concretely, we first sample 1000 context (each with 1-5 sentences) from news articles in XSUM dataset Narayan et al. (2018). Then we prompt GPT2 to generate 100 next sentences per each contextual constraint. To evaluate whether these sentences are indeed paraphrastic to each other, we measure their (1) pair-wise semantic equivalence, and (2) surface-form dissimilarity. For semantic equivalence, we employ an off-the-shelf NLI model Liu et al. (2022a), and determine a pair (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) to be semantically equivalent if entailment holds in both directions. For surface-form dissimilarity, we compute the Self-BLEU Zhu et al. (2018) between sentences, for only the semantically equivalent pairs.

The results are presented in Fig. 3. In the left figure, as the context becomes longer (i.e. more informative), the generated sentences are more likely to be semantically equivalent, verifying our assumption. Notably, the high semantic equivalence does not come from merely generating sentences with similar surface form – even with longer context, the average pair-wise Self-BLEU is around 32111The average Self-BLEU of human-authored paraphrases in MRPC Dolan and Brockett (2005) dataset is 39 Herbold (2023).. The results indicate that GPT2 can generate a large number of paraphrases simply by over-sampling multiple completions to the given context.

Refer to caption
Figure 3: How paraphrastic are GPT2-XL generations? We compute the ratio of semantically equivalent pairs and their average Self-BLEU.

On the right side of Fig 3, we also gauge the paraphrastic proximity under various decoding temperatures. Here, the ratio of semantically equivalent pairs dramatically increases as the temperature decreases. Low temperature adjusts the sampling distribution to be more skewed towards regions with high probability mass, hence allowing paraphrases to be more easily sampled from these subspaces. However, it is important that the temperature balance the trade-off between sample efficiency and diversity; when the temperature is too low, the generated sentences are almost identical and hence does not qualify as desirable paraphrase.

3 Impossible Distillation

Impossible Distillation starts from an off-the-shelf teacher LM Tsubscript𝑇\mathcal{M}_{\mathit{T}}caligraphic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, and distills its knowledge into a student LM Ssubscript𝑆\mathcal{M}_{\mathit{S}}caligraphic_M start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, yielding a specialized model 𝑡𝑎𝑠𝑘subscript𝑡𝑎𝑠𝑘\mathcal{M}_{\textit{task}}caligraphic_M start_POSTSUBSCRIPT task end_POSTSUBSCRIPT for paraphrasing and sentence summarization. As a byproduct of this process, we also obtain a high-quality dataset 𝒟𝑡𝑎𝑠𝑘subscript𝒟𝑡𝑎𝑠𝑘\mathcal{D}_{\textit{task}}caligraphic_D start_POSTSUBSCRIPT task end_POSTSUBSCRIPT. Below, we detail the process focusing on paraphrase generation as the task of interest, then discuss how this generalizes to sentence summarization.

3.1 Pair Generation

We first generate a large pool of candidate (source-paraphrase) pairs 𝒞T={(x1,y1),,\mathcal{C}_{\mathit{T}}=\{(x_{1},y_{1}),\cdots,caligraphic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ⋯ , (x|𝒞𝑇|,y|𝒞𝑇|)}(x_{|\mathcal{C}_{\textit{T}}|},y_{|\mathcal{C}_{\textit{T}}|})\}( italic_x start_POSTSUBSCRIPT | caligraphic_C start_POSTSUBSCRIPT T end_POSTSUBSCRIPT | end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT | caligraphic_C start_POSTSUBSCRIPT T end_POSTSUBSCRIPT | end_POSTSUBSCRIPT ) } from an off-the-shelf teacher Tsubscript𝑇\mathcal{M}_{\mathit{T}}caligraphic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. Our first step is to prepare contextual constraints cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, by sampling 1-5 sentences from Tsubscript𝑇\mathcal{M}_{\mathit{T}}caligraphic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT:

cipT()similar-tosubscript𝑐𝑖subscript𝑝subscript𝑇\begin{split}&c_{i}\sim p_{\mathcal{M}_{\mathit{T}}}(\cdot)\end{split}start_ROW start_CELL end_CELL start_CELL italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ) end_CELL end_ROW

The contextual constraints could be generated either unconditionally or conditioned on a simple prompt (Appendix A). Alternatively, one could sample contextual constraints from human-written corpus (as done in §2). While manually collecting contextual constraints allows fine-grained control over the generation style and domain, we show that LM-generated context suffices to yield a highly diverse and domain-specific data pool without resorting to an external source of data (§4.1).

Next, we generate a batch of next sentences conditioned on each cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, then enumerate candidate pairs as the combinations of these sentences:

{si1,,sik}pT(|ci;τ𝑡𝑒𝑚𝑝)𝒞i={(sim,sin)|m,n[1,k],mn}\begin{split}&\{s_{i1},\cdots,s_{ik}\}\sim p_{\mathcal{M}_{\mathit{T}}}(\cdot|% c_{i};\tau_{\textit{temp}})\\ &\mathcal{C}_{i}=\{(s_{im},s_{in})|m,n\in[1,k],m\neq n\}\end{split}start_ROW start_CELL end_CELL start_CELL { italic_s start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , ⋯ , italic_s start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT } ∼ italic_p start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_τ start_POSTSUBSCRIPT temp end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { ( italic_s start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ) | italic_m , italic_n ∈ [ 1 , italic_k ] , italic_m ≠ italic_n } end_CELL end_ROW

Concretely, we set k=100𝑘100k=100italic_k = 100, generating 100 samples per cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using Nucleus-Sampling Holtzman et al. (2020). Based on our preliminary experiments in §2, we set the decoding temperature τ𝑡𝑒𝑚𝑝=0.7subscript𝜏𝑡𝑒𝑚𝑝0.7\tau_{\textit{temp}}=0.7italic_τ start_POSTSUBSCRIPT temp end_POSTSUBSCRIPT = 0.7 to balance the diversity and sample efficiency of the generated pairs. Collecting the pairs across all cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTs, we obtain the data pool 𝒞T=i𝒞isubscript𝒞𝑇subscript𝑖subscript𝒞𝑖\mathcal{C}_{\mathit{T}}=\bigcup_{i}\mathcal{C}_{i}caligraphic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = ⋃ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

3.2 Filtering with Critics

Despite producing a large population of valid paraphrases, our pair generation process is noisy in nature, as it enumerates all possible pairs of generated sentences. For example, Generation 1 (I wish to visit there sometime.) and Generation N (I hate a long flight.) in Fig. 2 will constitute a pair in the data pool, although the two sentences have no logical relevance. A crucial step, therefore, is to filter out suboptimal pairs from the data pool and ensure the quality of the distilled dataset.

Semantic Equivalence Filter

A faithful paraphrase should preserve the semantics of the source statement without hallucinating unsupported content. NLI models are well-suited to quantify this relationship, as they are trained to infer the logical entailment between an arbitrary pair of statements Chen et al. (2021). Hence, we define a binary filter using a small NLI model Liu et al. (2022a) as a critic, and discard the pairs that do not achieve the entailment score over the threshold τ𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐subscript𝜏𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐\tau_{\textit{semantic}}italic_τ start_POSTSUBSCRIPT semantic end_POSTSUBSCRIPT:

f𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐(x,y)=𝟙{p𝑁𝐿𝐼(xy)τ𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐p𝑁𝐿𝐼(yx))τ𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐}\begin{split}f_{\textit{semantic}}(x,y)=\mathbbm{1}\Bigl{\{}&p_{\textit{NLI}}(% x\Rightarrow y)\geq\tau_{\textit{semantic}}\,\,\land\\ &p_{\textit{NLI}}(y\Rightarrow x))\geq\tau_{\textit{semantic}}\Bigr{\}}\end{split}start_ROW start_CELL italic_f start_POSTSUBSCRIPT semantic end_POSTSUBSCRIPT ( italic_x , italic_y ) = blackboard_1 { end_CELL start_CELL italic_p start_POSTSUBSCRIPT NLI end_POSTSUBSCRIPT ( italic_x ⇒ italic_y ) ≥ italic_τ start_POSTSUBSCRIPT semantic end_POSTSUBSCRIPT ∧ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_p start_POSTSUBSCRIPT NLI end_POSTSUBSCRIPT ( italic_y ⇒ italic_x ) ) ≥ italic_τ start_POSTSUBSCRIPT semantic end_POSTSUBSCRIPT } end_CELL end_ROW

Dissimilarity Filter

A good paraphrase should significantly alter the surface form of the input while preserving its meaning. In Impossible Distillation, the surface-form dissimilarity is achieved by filtering pairs based on (1) the token overlap between sentences and (2) their syntactic difference. For token overlap, we filter the pairs with higher ROUGE-L Lin (2004) than a threshold τ𝑟𝑜𝑢𝑔𝑒subscript𝜏𝑟𝑜𝑢𝑔𝑒\tau_{\textit{rouge}}italic_τ start_POSTSUBSCRIPT rouge end_POSTSUBSCRIPT. To gauge the syntactic difference, we follow prior works Kumar et al. (2020) by first parsing the constituency tree of the source and paraphrase, then filtering based on their tree edit distance (TED):

f𝑑𝑖𝑠𝑠𝑖𝑚(x,y)=𝟙{ROUGE(x,y)τ𝑟𝑜𝑢𝑔𝑒TED(x,y)τ𝑇𝐸𝐷}subscript𝑓𝑑𝑖𝑠𝑠𝑖𝑚𝑥𝑦1ROUGE𝑥𝑦subscript𝜏𝑟𝑜𝑢𝑔𝑒TED𝑥𝑦subscript𝜏𝑇𝐸𝐷\begin{split}f_{\textit{dissim}}(x,y)=\mathbbm{1}\Bigl{\{}&\text{ROUGE}(x,y)% \leq\tau_{\textit{rouge}}\,\,\land\\ &\text{TED}(x,y)\geq\tau_{\textit{TED}}\,\,\Bigr{\}}\end{split}start_ROW start_CELL italic_f start_POSTSUBSCRIPT dissim end_POSTSUBSCRIPT ( italic_x , italic_y ) = blackboard_1 { end_CELL start_CELL ROUGE ( italic_x , italic_y ) ≤ italic_τ start_POSTSUBSCRIPT rouge end_POSTSUBSCRIPT ∧ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL TED ( italic_x , italic_y ) ≥ italic_τ start_POSTSUBSCRIPT TED end_POSTSUBSCRIPT } end_CELL end_ROW

Intuitively, the two dimensions of dissimilarity complements each other – while ROUGE filter promotes lexical divergence in each pair, TED filter preempts “hacking” the token-overlap metric by simply switching a few words in the source sentence with corresponding synonyms.

Diversity Filter

Constructing a high-quality corpus is not just about creating valid input-output pairs; ideally, the corpus should cover a diverse range of style and topic within its samples, as the data diversity directly correlates with the robustness of the trained model Rebuffi et al. (2021). Our data pool might be limited in this regard, as it includes a large number of pairs from the same context c𝑐citalic_c, often resulting in multiple pairs having similar x𝑥xitalic_x or y𝑦yitalic_y. To remove the duplicate pairs and promote diversity, we employ an additional critic f𝑑𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦subscript𝑓𝑑𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦f_{\textit{diversity}}italic_f start_POSTSUBSCRIPT diversity end_POSTSUBSCRIPT. Concretely, we define two pairs (x1,y1)subscript𝑥1subscript𝑦1(x_{1},y_{1})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and (x2,y2)subscript𝑥2subscript𝑦2(x_{2},y_{2})( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) to be duplicate when one pair entails another, either on the input side (x1x2subscript𝑥1subscript𝑥2x_{1}\Rightarrow x_{2}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⇒ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) or the output side (y1y2subscript𝑦1subscript𝑦2y_{1}\Rightarrow y_{2}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⇒ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). The diversity filter operates by first grou** all entailing pairs, then discarding all but one with the largest entailment score. In practice, this filter can be efficiently implemented using graph traversal; we describe the formal algorithm in Appendix A.

Incorporating all critics, we filter the candidate pool 𝒞𝑇subscript𝒞𝑇\mathcal{C}_{\textit{T}}caligraphic_C start_POSTSUBSCRIPT T end_POSTSUBSCRIPT into a refined dataset 𝒟𝑇subscript𝒟𝑇\mathcal{D}_{\textit{T}}caligraphic_D start_POSTSUBSCRIPT T end_POSTSUBSCRIPT as following:

𝒟𝑇={(x,y)|(x,y)𝒞𝑇,f𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐f𝑑𝑖𝑠𝑠𝑖𝑚f𝑑𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦(x,y)=1}subscript𝒟𝑇conditional-set𝑥𝑦formulae-sequence𝑥𝑦subscript𝒞𝑇subscript𝑓𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐subscript𝑓𝑑𝑖𝑠𝑠𝑖𝑚subscript𝑓𝑑𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦𝑥𝑦1\begin{split}\mathcal{D}_{\textit{T}}=\{(x,y)|&(x,y)\in\mathcal{C}_{\textit{T}% },\\ &f_{\textit{semantic}}\land f_{\textit{dissim}}\land f_{\textit{diversity}}(x,% y)=1\}\end{split}start_ROW start_CELL caligraphic_D start_POSTSUBSCRIPT T end_POSTSUBSCRIPT = { ( italic_x , italic_y ) | end_CELL start_CELL ( italic_x , italic_y ) ∈ caligraphic_C start_POSTSUBSCRIPT T end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_f start_POSTSUBSCRIPT semantic end_POSTSUBSCRIPT ∧ italic_f start_POSTSUBSCRIPT dissim end_POSTSUBSCRIPT ∧ italic_f start_POSTSUBSCRIPT diversity end_POSTSUBSCRIPT ( italic_x , italic_y ) = 1 } end_CELL end_ROW

3.3 Distilling Student Model

Now that we extracted the paraphrastic knowledge of the teacher Tsubscript𝑇\mathcal{M}_{\mathit{T}}caligraphic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT into a dataset 𝒟𝑇subscript𝒟𝑇\mathcal{D}_{\textit{T}}caligraphic_D start_POSTSUBSCRIPT T end_POSTSUBSCRIPT, we use the data to fine-tune the student model into a paraphrase generation model. The student model Ssubscript𝑆\mathcal{M}_{\mathit{S}}caligraphic_M start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is fine-tuned by maximizing 𝔼(x,y)𝒟𝑇[logpS(y|x)]subscript𝔼similar-to𝑥𝑦subscript𝒟𝑇delimited-[]subscript𝑝subscript𝑆conditional𝑦𝑥\mathbbm{E}_{(x,y)\sim\mathcal{D}_{\textit{T}}}[\log p_{\mathcal{M}_{\mathit{S% }}}(y|x)]blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT T end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y | italic_x ) ], i.e. the conditional log-likelihood of y𝑦yitalic_y given x𝑥xitalic_x.

Next, the paraphrasing capability of the student is further amplified through self-distillation, by fine-tuning on its own generated high-quality paraphrases. We first sample the input sentence x𝑥xitalic_x from the teacher LM Tsubscript𝑇\mathcal{M}_{\mathit{T}}caligraphic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, then generate paraphrase y𝑦yitalic_y by feeding x𝑥xitalic_x into Ssubscript𝑆\mathcal{M}_{\mathit{S}}caligraphic_M start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT:

𝒞𝑆={(x1,y1),|xipT(|ci);yipS(|xi)}\begin{split}\mathcal{C}_{\textit{S}}=\{(x_{1},y_{1}),\cdots|x_{i}\sim p_{% \mathcal{M}_{\mathit{T}}}(\cdot|c_{i});\,y_{i}\sim p_{\mathcal{M}_{\mathit{S}}% }(\cdot|x_{i})\}\end{split}start_ROW start_CELL caligraphic_C start_POSTSUBSCRIPT S end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ⋯ | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ; italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } end_CELL end_ROW

Using the same critics as in the previous stage, we filter 𝒞𝑆subscript𝒞𝑆\mathcal{C}_{\textit{S}}caligraphic_C start_POSTSUBSCRIPT S end_POSTSUBSCRIPT to obtain a high-quality dataset 𝒟𝑝𝑎𝑟𝑎subscript𝒟𝑝𝑎𝑟𝑎\mathcal{D}_{\textit{para}}caligraphic_D start_POSTSUBSCRIPT para end_POSTSUBSCRIPT. Finally, we fine-tune Ssubscript𝑆\mathcal{M}_{\mathit{S}}caligraphic_M start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT on 𝒟𝑝𝑎𝑟𝑎subscript𝒟𝑝𝑎𝑟𝑎\mathcal{D}_{\textit{para}}caligraphic_D start_POSTSUBSCRIPT para end_POSTSUBSCRIPT, yielding the end-stage model 𝑝𝑎𝑟𝑎subscript𝑝𝑎𝑟𝑎\mathcal{M}_{\textit{para}}caligraphic_M start_POSTSUBSCRIPT para end_POSTSUBSCRIPT. Consistent with prior findings on self-distillation Pham et al. (2022); Allen-Zhu and Li (2020), this simple process significantly improves the performance of our task model, as confirmed by our ablation study (§4.5). In addition, our self-distillation outputs a large-scale, standalone dataset 𝒟𝑝𝑎𝑟𝑎subscript𝒟𝑝𝑎𝑟𝑎\mathcal{D}_{\textit{para}}caligraphic_D start_POSTSUBSCRIPT para end_POSTSUBSCRIPT that can be evaluated and reused, e.g., to directly train a paraphrasing model without re-iterating the distillation procedure.

Dataset (# Instances) Semantic Similarity Lexical Diversity Syntactic Diversity
Cosine Sim. \uparrow H2subscript𝐻2H_{2}italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT \uparrow H3subscript𝐻3H_{3}italic_H start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT \uparrow MSTTR \uparrow Jaccard Sim. \downarrow TED-3 \uparrow TED-F \uparrow
ParaBank1 (57.0M) 81.77 17.07 21.66 45.52 48.41 3.59 14.53
ParaBank2 (19.7M) 82.50 17.48 21.44 46.16 43.44 4.04 17.41
ChatGPT-Para (2.1M) 85.44 17.67 21.41 35.83 44.56 4.26 20.15
DIMPLE (4.2M) 87.68 17.75 22.46 53.08 43.62 5.02 29.84
Table 1: Quality comparison between paraphrase datasets. DIMPLE, as a purely synthetic corpus generated from 1.5B LMs, exhibits better diversity compared to others, including the dataset constructed by prompting ChatGPT.

3.4 Endowing Controllability

Recent works emphasize the importance of syntactic control in paraphrase generation, allowing the model to generate an output paraphrase tailored to users’ need Chen et al. (2019). In Impossible Distillation, endowing controllability to the student model is straightforward. We first prepare the dataset 𝒟𝑐𝑜𝑛𝑡𝑟𝑜𝑙subscript𝒟𝑐𝑜𝑛𝑡𝑟𝑜𝑙\mathcal{D}_{\textit{control}}caligraphic_D start_POSTSUBSCRIPT control end_POSTSUBSCRIPT by parsing the constituency tree t𝑡titalic_t of each paraphrase y𝑦yitalic_y in 𝒟𝑝𝑎𝑟𝑎subscript𝒟𝑝𝑎𝑟𝑎\mathcal{D}_{\textit{para}}caligraphic_D start_POSTSUBSCRIPT para end_POSTSUBSCRIPT:

𝒟𝑐𝑜𝑛𝑡𝑟𝑜𝑙={(x,y,t)|(x,y)𝒟𝑝𝑎𝑟𝑎,t=parse(y)}subscript𝒟𝑐𝑜𝑛𝑡𝑟𝑜𝑙conditional-set𝑥𝑦𝑡formulae-sequence𝑥𝑦subscript𝒟𝑝𝑎𝑟𝑎𝑡parse𝑦\begin{split}\mathcal{D}_{\textit{control}}=\{(x,y,t)|(x,y)\in\mathcal{D}_{% \textit{para}},t=\text{parse}(y)\}\end{split}start_ROW start_CELL caligraphic_D start_POSTSUBSCRIPT control end_POSTSUBSCRIPT = { ( italic_x , italic_y , italic_t ) | ( italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT para end_POSTSUBSCRIPT , italic_t = parse ( italic_y ) } end_CELL end_ROW

Then a controllable model 𝑐𝑜𝑛𝑡𝑟𝑜𝑙subscript𝑐𝑜𝑛𝑡𝑟𝑜𝑙\mathcal{M}_{\textit{control}}caligraphic_M start_POSTSUBSCRIPT control end_POSTSUBSCRIPT can be trained, by fine-tuning Ssubscript𝑆\mathcal{M}_{\mathit{S}}caligraphic_M start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT to generate y𝑦yitalic_y given the source x𝑥xitalic_x and the tree t𝑡titalic_t.

3.5 DIMPLE and Impossible-T5

To test Impossible Distillation in both general and domain-specific paraphrasing, we use two teacher LMs – GPT2-XL and BioGPT (Luo et al., 2022; for biomedical domain), all with 1.5B parameters. We use T5-large Raffel et al. (2020) with 770M parameters as our student LM, and train it with 400k filtered pairs distilled from the teacher models. After self-distilling the student model, we yield a specialized paraphrase generation model we call Impossible-T5, along with a large-scale corpus with 4M high-quality pairs (2M for general domain and 2M for biomedical domain). We name this dataset DIMPLE222Dataset of Impossible Paraphrases.. Additional implementation details such as generation parameters and filter thresholds are provided in Appendix A.

3.6 Sentence Summarization

In addition, we generalize Impossible Distillation for abstractive sentence summarization, a task akin to paraphrasing but with different goals Zhou and Rush (2019). Whereas paraphrase generation searches for an alternative form of the source sentence while preserving all its information, summarization aims for a succinct representation of the given sentence, at the expense of losing tangential details. In our method, the distillation process is grounded on the explicit evaluation of generated pairs, hence the framework can generalize to summarization by simply redefining the filters. For example, one could add a length filter to the critics, guaranteeing that the output is strictly shorter than the input in the filtered pairs. In Appendix C, we describe the details of the generalized pipeline for sentence summarization.

4 Experiments

Dataset MRPC ParaNMT-small ParaSCI-arXiv
Model iBLEU B-iB BLEU R-L iBLEU B-iB BLEU R-L iBLEU B-iB BLEU R-L
Copy-Input 1.1 0.0 44.4 65.7 -13.7 0.0 23.3 38.2 0.0 0.0 42.8 60.2
GPT-3.5 5.7 66.5 18.1 35.7 6.2 79.6 14.1 32.2 4.0 63.2 16.2 37.5
ChatGPT 5.9 67.7 18.0 36.5 6.7 81.3 13.4 32.8 3.6 63.9 16.4 36.1
T5ParaBank1subscriptT5ParaBank1\text{T5}_{\text{ParaBank1}}T5 start_POSTSUBSCRIPT ParaBank1 end_POSTSUBSCRIPT 5.8 60.0 27.3 50.3 4.4 70.3 19.2 36.5 3.1 53.3 26.2 48.6
T5ParaBank2subscriptT5ParaBank2\text{T5}_{\text{ParaBank2}}T5 start_POSTSUBSCRIPT ParaBank2 end_POSTSUBSCRIPT 6.1 60.5 27.7 51.4 5.3 71.5 19.7 37.6 3.7 55.1 24.9 48.2
T5ChatGPT-ParasubscriptT5ChatGPT-Para\text{T5}_{\text{ChatGPT-Para}}T5 start_POSTSUBSCRIPT ChatGPT-Para end_POSTSUBSCRIPT 5.4 62.9 21.2 40.7 5.2 74.4 15.2 33.5 3.8 59.6 23.8 41.0
Impossible-T5 7.3 67.1 26.0 46.8 5.9 79.7 16.3 31.8 4.3 64.5 25.6 44.9
Table 2: Experimental results of Impossible-T5 and baselines on unconstrained paraphrase generation. Impossible-T5 outperforms the same size model trained on much larger datasets, and is competitive to 175B LLM in both general and domain-specific benchmarks.

4.1 Dataset Evaluation

Evaluation Setup

First, we directly compare the quality of DIMPLE against three large-scale paraphrase corpora: ParaBank1 Hu et al. (2019a), ParaBank2 Hu et al. (2019b), and ChatGPT-Para Vorobev et al. (2023). Both ParaBank1 and ParaBank2 are based on back-translation; ParaBank1 imposes lexical constraints to promote the diversity of paraphrases, and ParaBank2 additionally clusters and resamples generations to further improve the syntactic diversity. ChatGPT-Para is a dataset distilled from ChatGPT, by instructing the LLM to paraphrase sentences from Quora Sharma et al. (2019), SQUAD 2.0 Rajpurkar et al. (2018) and CNN/DM Nallapati et al. (2016).

Following Huang et al. (2023), we measure the semantic similarity, lexical diversity and syntactic diversity of each dataset. For semantic similarity, we compute the average cosine-similarity between source and paraphrase measured by SimCSE Gao et al. (2021). To estimate lexical diversity, we use 2/3-gram entropy, mean-segmented token type ratio (MSTTR; Torruella and Capsada, 2013), and the token-level Jaccard similarity between source and paraphrase. For syntactic diversity, we compute the average pairwise tree edit distance, either for the top 3 layers (TED-3) or the full tree (TED-F).

Results

The results are shown in Table 1. In all 3 dimensions, DIMPLE consistently outperforms all baseline datasets. Notably, this includes ParaBank1 that is more than 13 times larger than DIMPLE, demonstrating the sample efficiency of our framework in extracting diverse paraphrastic knowledge. In addition, the superior results of our dataset compared to ChatGPT-Para implies that the scale of the LM is not the only factor that determines the quality of generated data. By effectively constraining the LM search space and filtering pairs with a composition of critics, Impossible Distillation makes it possible to generate a high-quality dataset from a small, low-quality LM.

Model Fleunt Faithful Dissimilar
ChatGPT 2.8 2.37 2.38
T5ParaBank1subscriptT5ParaBank1\text{T5}_{\text{ParaBank1}}T5 start_POSTSUBSCRIPT ParaBank1 end_POSTSUBSCRIPT 2.45 2.33 2.09
T5ParaBank2subscriptT5ParaBank2\text{T5}_{\text{ParaBank2}}T5 start_POSTSUBSCRIPT ParaBank2 end_POSTSUBSCRIPT 2.47 2.50 2.21
T5ChatGPT-ParasubscriptT5ChatGPT-Para\text{T5}_{\text{ChatGPT-Para}}T5 start_POSTSUBSCRIPT ChatGPT-Para end_POSTSUBSCRIPT 2.64 2.30 2.33
Impossible-T5 2.74 2.55 2.40
Table 3: Human evaluation results (Krippendorff’s alpha = 0.62; substantial inter-annotator agreement). To minimize subjectivity, we use strict 3-level scale, where 3 indicates perfect satisfaction, and 1 indicates complete dissatisfaction of the desired property.

4.2 Unconstrained Paraphrase Generation

Evaluation Setup

In this section, we evaluate Impossible-T5 in multiple benchmarks for paraphrase generation without syntactic control. We use three human-curated benchmarks spanning general and domain-specific paraphrase generation: MRPC Dolan and Brockett (2005), ParaNMT-small Kumar et al. (2020), and ParaSCI-arXiv Dong et al. (2021). For baselines, we fine-tune T5-large with ParaBank1, ParaBank2 and ChatGPT-Para, the three large-scale paraphrase corpora analyzed in §4.1. We also consider LLM-based baselines, by zero-shot prompting GPT3.5 (text-davinci-003; Ouyang et al., 2022) and ChatGPT.

Results

In Table 2, we report BLEU and ROUGE-L (R-L) along with iBLEU Sun and Zhou (2012) and BERT-iBLEU (B-iB), a metric known to better correlate with human judgements of paraphrase quality Niu et al. (2021). Consistent to the prior findings on the brittleness of token-overlap metrics Zhang et al. (2020), BLEU and ROUGE-L fail to accurately assess the paraphrase quality. In fact, a simple baseline that merely copies the input as an output (Copy-Input) marks state-of-the-art on these metrics, across all datasets.

A clearer tendency is shown with iBLEU and BERT-iBLEU: Impossible-T5 consistently outperforms the same size model trained on order of magnitude larger ParaBank, showing up to 10% relative improvement across all benchmarks. Moreover, Impossible-T5 is the only 770M model comparable to 175B GPT-3.5 across all benchmarks. Notably in an expert domain (ParaSCI), it even outperforms ChatGPT. Additional results against state-of-the-art unsupervised paraphrase generation methods are presented in Appendix B.

Human Evaluation

We additionally conduct human evaluation to compare the quality of the generated paraphrases. We generate 200 paraphrases with each model using MRPC corpus, and ask six Mechanical Turk workers to evaluate whether each paraphrase is (1) fluent, (2) faithful to the source, and (3) dissimilar to the source (Appendix D). Table 3 shows the results. Consistent to the quantitative metrics, human annotators prefer paraphrases from Impossible-T5 than the competitive baselines. We find that our model is generally considered to be more faithful to the original statement than ChatGPT while sufficiently altering the surface form. Notably, the high faithfulness and dissimilarity does not come from sacrificing the soundness of generation, marking better fluency score than both T5ParaBanksubscriptT5ParaBank\text{T5}_{\text{ParaBank}}T5 start_POSTSUBSCRIPT ParaBank end_POSTSUBSCRIPT and T5ChatGPT-ParasubscriptT5ChatGPT-Para\text{T5}_{\text{ChatGPT-Para}}T5 start_POSTSUBSCRIPT ChatGPT-Para end_POSTSUBSCRIPT.

4.3 Syntactically Controlled Paraphrase Generation

Model iBLEU \uparrow B-iB \uparrow R-L \uparrow TED-F \downarrow
ChatGPT0-shotsubscriptChatGPT0-shot\text{ChatGPT}_{\text{0-shot}}ChatGPT start_POSTSUBSCRIPT 0-shot end_POSTSUBSCRIPT 9.1 85.8 41.6 11.6
ChatGPT5-shotsubscriptChatGPT5-shot\text{ChatGPT}_{\text{5-shot}}ChatGPT start_POSTSUBSCRIPT 5-shot end_POSTSUBSCRIPT 9.0 85.9 42.2 10.3
T5ParaBank1subscriptT5ParaBank1\text{T5}_{\text{ParaBank1}}T5 start_POSTSUBSCRIPT ParaBank1 end_POSTSUBSCRIPT 10.7 82.3 55.6 8.4
T5ParaBank2subscriptT5ParaBank2\text{T5}_{\text{ParaBank2}}T5 start_POSTSUBSCRIPT ParaBank2 end_POSTSUBSCRIPT 10.9 84.7 57.5 8.8
T5ChatGPT-ParasubscriptT5ChatGPT-Para\text{T5}_{\text{ChatGPT-Para}}T5 start_POSTSUBSCRIPT ChatGPT-Para end_POSTSUBSCRIPT 10.5 79.4 47.6 10.4
Impossible-T5 11.2 86.6 51.8 8.5
Table 4: Results on syntactically controlled paraphrase generation. Impossible-T5 outperforms baselines in both paraphrase quality and controllability.

Evaluation Setup

Next, we assess Impossible-T5 in syntactically controlled paraphrase generation. We use ParaNMT-small, where each sample consists of a source x𝑥xitalic_x, a syntactic exemplar z𝑧zitalic_z, and a paraphrase y𝑦yitalic_y of x𝑥xitalic_x that follows the syntax of z𝑧zitalic_z. Since the controllable version of our model is trained with the constituency tree as input, we first parse z𝑧zitalic_z and feed the tree into our model (along with x𝑥xitalic_x) during inference. For baselines, we consider T5 trained with existing corpora, additionally annotated with the tree of target paraphrases. We also prompt ChatGPT to generate paraphrase using the same syntax with z𝑧zitalic_z.

Results

The results are shown in Table 4. Impossible-T5 outperforms baselines across all metrics except ROUGE-L. Notably, the syntax conformity of ChatGPT is substantially poor, even with 5-shot examples of syntax-controlled paraphrases. The results imply that distilling a fine-grained controllable model could be a reasonable alternative to prompting LLM with a textual description of the desired output.

Model Automatic Human
B-F1 R-L Fluent Faithful Concise
ChatGPT 84.8 33.6 2.55 2.44 2.32
Referee 78.2 29.2 2.45 2.33 2.41
T5ParaBanksubscriptT5ParaBank\text{T5}_{\text{ParaBank}}T5 start_POSTSUBSCRIPT ParaBank end_POSTSUBSCRIPT 77.5 29.6 2.21 2.17 1.96
Impossible-T5 85.1 30.3 2.46 2.53 2.49
Table 5: Results on sentence summarization. We report BERTScore-F1 Zhang et al. (2020) and ROUGE-L for automatic evaluation. In addition, six crowd-source workers qualitatively assessed the 100 summaries per each model with 3-level likert scale.

4.4 Sentence Summarization

Evaluation Setup

We use Gigaword Rush et al. (2015), a representative benchmark for sentence summarization. For baselines, we use ChatGPT and Referee Sclar et al. (2022), an unsupervised summarizer distilled from GPT3. We also train T5-large on ParaBanksummsubscriptParaBanksumm\text{ParaBank}_{\text{summ}}ParaBank start_POSTSUBSCRIPT summ end_POSTSUBSCRIPT, a variant of ParaBank2 filtered using the same set of summarization critics as for Impossible Distillation.

Results

The results are as seen in Table 5. We observe that re-purposing a paraphrase corpus for summarization (T5ParaBanksubscriptT5ParaBank\text{T5}_{\text{ParaBank}}T5 start_POSTSUBSCRIPT ParaBank end_POSTSUBSCRIPT) leads to sub-optimal performance, as the back-translation does not reflect the concise nature of summaries. In contrast, the critic models in Impossible Distillation explicitly participate in the data-generating process, by promoting the model to generate outputs that satisfy the desired properties of critic models. As a result, Impossible Distillation successfully generalizes to summarization, only by plugging in the redefined composition of filters.

Model BERT-iBLEU
Student model w/o Self-distillation 64.0
T5ChatGPT-ParasubscriptT5ChatGPT-Para\text{T5}_{\text{ChatGPT-Para}}T5 start_POSTSUBSCRIPT ChatGPT-Para end_POSTSUBSCRIPT 62.9
T5ChatGPT-ParasubscriptT5ChatGPT-Para\text{T5}_{\text{ChatGPT-Para}}T5 start_POSTSUBSCRIPT ChatGPT-Para end_POSTSUBSCRIPT + Self-distillation 63.3
T5ChatGPT-ParasubscriptT5ChatGPT-Para\text{T5}_{\text{ChatGPT-Para}}T5 start_POSTSUBSCRIPT ChatGPT-Para end_POSTSUBSCRIPT + Critic Filtering 64.1
Impossible-T5 67.1
Table 6: Ablation study on MRPC. The best configuration is Impossible Distillation incorporating both critic filtering and self-distillation.

4.5 Ablation Study

In Table 6, we conduct an ablation study to analyze the contribution of different components in Impossible Distillation.

Paraphrase Generation (General domain)

Constraint c𝑐citalic_c

As part of the process for the upcoming release of the Android M, Google is also adding a new camera API to the latest Android OS.

Sentence x𝑥xitalic_x

This API allows third-party apps to use the camera of Android devices.

Paraphrase y𝑦yitalic_y

The new API will allow developers to use Android’s camera features to create custom apps.

Paraphrase Generation (Biomedical domain)

Constraint c𝑐citalic_c

The impact of obesity on health-related quality of life (HRQOL) in adolescents and young adults with spinal deformity is not well described.

Sentence x𝑥xitalic_x

The purpose of this study was to compare HRQOL measures in adolescent idiopathic scoliosis (AIS) patients with and without obesity.

Paraphrase y𝑦yitalic_y

This study aimed to investigate the relationship between HRQOL and obesity in adolescents with idiopathic scoliosis (AIS).

Summarization (General domain)

Constraint c𝑐citalic_c

There had been fears the flare could ignite the esca** gas at the Elgin platform, about 150 miles (240 km) east of the Scottish city of Aberdeen, potentially causing a huge explosion. Total said it had received the first indication that the flare might be out at lunchtime on Friday. The firm is “mobilizing all means to allow these options to be implemented," it said. The company, which is still investigating the cause of the leak, estimates that 200,000 cubic meters of gas a day are esca**.

Sentence x𝑥xitalic_x

“The gas cloud is fairly small in size and prevailing winds are blowing it away from the platform and dispersing it,” Total said.

Summary y𝑦yitalic_y

The gas cloud is small and blowing away, Total said.

Summarization (Biomedical domain)

Constraint c𝑐citalic_c

A banana primarily consists of carbo hydrate chains (sugar), but also contains some minor amount of minerals and vitamins. Let’s see what happens with this stuff - Sugar: Will be broken down to either be stored as fat (another form of carbo hydrate chains) or broken up and used to provide cell energy; the resulting "waste" hydrogen and carbon is disposed of in form of CO2 or H2O. Minerals: Are mainly used to regenerate organs/tissue and other organ functions; these could probably be still in your body, but even if they are, they are probably very rare. Vitamins: The atoms are very often disposed after use, so they too leave your body.

Sentence x𝑥xitalic_x

They do leave in rather short time frames, because the body can’t store them well and needs it daily (that is why your diet should include them).

Summary y𝑦yitalic_y

They do leave in a short time, as the body does not store them long.

Table 7: Qualitative examples of pair generation. Along with each x𝑥xitalic_x and y𝑦yitalic_y, we present contextual constraint c𝑐citalic_c used for pair generation.

Does self-distillation matter?

We analyze the contribution of self-distillation in two ways. First, we omit the self-distillation stage in our framework and directly test the student model Ssubscript𝑆\mathcal{M}_{\mathit{S}}caligraphic_M start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT distilled from Tsubscript𝑇\mathcal{M}_{\mathit{T}}caligraphic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. BERT-iBLEU in this case degrades by 3.1 from Impossible-T5, indicating the importance of self-distillation in amplifying model capability. Next, in order to verify whether self-distillation is a dominant contributor to the performance, we iterate self-distillation on T5ChatGPT-ParasubscriptT5ChatGPT-Para\text{T5}_{\text{ChatGPT-Para}}T5 start_POSTSUBSCRIPT ChatGPT-Para end_POSTSUBSCRIPT. While the performance of T5ChatGPT-ParasubscriptT5ChatGPT-Para\text{T5}_{\text{ChatGPT-Para}}T5 start_POSTSUBSCRIPT ChatGPT-Para end_POSTSUBSCRIPT gets better with iterative distillation, the improvement is relatively small, leading to worse performance than our model without self-distillation. The result confirms that while self-distillation helps in improving the end-stage performance, the diversity of data distilled from teacher model is crucial to fully elicit the student model’s capability.

Is it all about critics?

At the core of Impossible Distillation are the critic models, filtering the noisy data pool generated from the teacher and aligning it for the target task. Therefore, it would be reasonable to ask whether the performance of Impossible-T5 solely comes from the composition of critics in our framework. To methodically verify this, we filter ChatGPT-Para using the same set of critics as in our framework, and train T5-large on the filtered dataset.

In this configuration, BERT-iBLEU on MRPC marks 64.1, improving over the original ChatGPT-Para but still falling behind Impossible-T5. We attribute this to the relatively small size of the filtered dataset (n340k)𝑛340𝑘(n\approx 340k)( italic_n ≈ 340 italic_k ), primarily due to a large portion of pairs not passing either the Dissimilarity or Diversity Filter. While ChatGPT can generate sensible paraphrases, it is not aligned with the specific evaluation criteria defined in the filtering stage, leading to the poor sample efficiency. Although the issue maybe mitigated via fine-tuning the teacher or over-sampling more generations, such solution would require substantially more compute than GPT2-scale LMs. In this sense, our framework provides an attractive alternative to LLM distillation, incorporating a small, cost-efficient data generator and a composition of filters, in replace of a gigantic data generator.

5 Related Works

Unsupervised Paraphrasing and Summarization

Conventional approaches for unsupervised paraphrasing and summarization have focused on task-specific surrogates – e.g., back-translation Wieting and Gimpel (2018); Hu et al. (2019b) and autoencoding Huang and Chang (2021); Baziotis et al. (2019) – that guide the model toward desired output. These surrogate tasks inherently provide weak supervision signal compared to the complexity of the target task, often mandating carefully engineered perturbations Huang et al. (2023); Niu et al. (2021), auxiliary constraints Liu et al. (2022b); Chen et al. (2022b) or a complete re-training of teacher model Meng et al. (2021). Apart from the task-specific methods, a growing line of research seeks to harness LLMs to paraphrase and summarize without supervision Tang et al. (2023); Goyal et al. (2023). In fact, recent findings suggest that zero-shot generations prompted from LLMs exhibit human-level quality in various use-cases Wahle et al. (2022).

Task-solving with Language Model

More broadly, task-solving capabilities of LMs have been tested and analyzed across domains Hendrycks et al. (2021). While large-scale pre-training allows models to acquire sufficient knowledge to solve complex tasks Bommarito et al. (2023); Brahman et al. (2023); Jung et al. (2022), recent works suggest that their full capability is elicited from aligning the model knowledge with additional fine-tuning – e.g. using instruction data Chung et al. (2022); Wang et al. (2022) and human feedback Ouyang et al. (2022); Ziegler et al. (2020) – which often requires a curated set of annotated data. Our work suggests an alternative to this paradigm, by identifying and leveraging the paraphrastic knowledge intrinsic to the LM, rather than human annotation.

Data Generation with Language Model

Another line of related works propose to directly distill models with LM-generated data, improving model reasoning Zelikman et al. (2022); Hsieh et al. (2023), robustness Chen et al. (2022a), controllability Sclar et al. (2022), and language understanding Ye et al. (2022). These works essentially follow the conceptual framework of Symbolic Knowledge Distillation West et al. (2022), where a teacher model’s knowledge is transferred to a student model via a symbolic, textual dataset. Other works explore to extract a standalone corpus from LMs, whether it be knowledge base Alivanistos et al. (2022), dialogue Kim et al. (2022), or evaluation suite Perez et al. (2022). However, these works typically impose a strong assumption on the teacher LM Wang et al. (2021), and require manually constructed set of prompts Bhagavatula et al. (2022). Overcoming these limitations, Impossible Distillation generalizes data generation into an off-the-shelf setup, removing the dependence to the teacher model’s capability for the target task.

6 Conclusion

In this work, we propose Impossible Distillation, a novel framework to distill high-quality paraphrase dataset and model from small, low-quality LMs. We show that by leveraging paraphrastic proximity and critic-guided distillation, Impossible Distillation can empower small LMs to outperform competitive counterparts – in both performance and controllability, across domains and tasks, without training on human-authored paraphrases. Also, we find that DIMPLE, the natural byproduct of our method, presents higher fidelity and diversity than order of magnitude larger paraphrase datasets. Impossible Distillation shows a promising direction to rediscover the under-explored capabilities of off-the-shelf language models, by accurately identifying their characteristics and amplifying them.

Acknowledgements

This work was funded in part by the DARPA MCS program through NIWC Pacific (N66001-19-2-4031), NSF DMS-2134012 and IARPA HIATUS via 2022-22072200003.

References

Algorithm 1 Diversity Filter
A set of pairs 𝒫in={(x1,y1),,(x|P|,y|P|)}subscript𝒫insubscript𝑥1subscript𝑦1subscript𝑥𝑃subscript𝑦𝑃\mathcal{P}_{\text{in}}=\{(x_{1},y_{1}),\cdots,(x_{|P|},y_{|P|})\}caligraphic_P start_POSTSUBSCRIPT in end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ⋯ , ( italic_x start_POSTSUBSCRIPT | italic_P | end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT | italic_P | end_POSTSUBSCRIPT ) } generated using the same prefix c𝑐citalic_c
Filtered set of pairs 𝒫outsubscript𝒫out\mathcal{P}_{\text{out}}caligraphic_P start_POSTSUBSCRIPT out end_POSTSUBSCRIPT
E𝐸E\leftarrow\emptysetitalic_E ← ∅
for i,j[1,|P|],ijformulae-sequence𝑖𝑗1𝑃𝑖𝑗i,j\in\bigl{[}1,|P|\bigr{]},i\neq jitalic_i , italic_j ∈ [ 1 , | italic_P | ] , italic_i ≠ italic_j do \triangleright search for duplicate pairs
     if P𝑁𝐿𝐼(xixj)>τ𝑒𝑛𝑡𝑎𝑖𝑙subscript𝑃𝑁𝐿𝐼subscript𝑥𝑖subscript𝑥𝑗subscript𝜏𝑒𝑛𝑡𝑎𝑖𝑙P_{\textit{NLI}}(x_{i}\Rightarrow x_{j})>\tau_{\textit{entail}}italic_P start_POSTSUBSCRIPT NLI end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⇒ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) > italic_τ start_POSTSUBSCRIPT entail end_POSTSUBSCRIPT then
         EE{(xi,yi),(xj,yj)}𝐸𝐸subscript𝑥𝑖subscript𝑦𝑖subscript𝑥𝑗subscript𝑦𝑗E\leftarrow E\cup\{(x_{i},y_{i}),(x_{j},y_{j})\}italic_E ← italic_E ∪ { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) }
     else if P𝑁𝐿𝐼(yiyj)>τ𝑒𝑛𝑡𝑎𝑖𝑙subscript𝑃𝑁𝐿𝐼subscript𝑦𝑖subscript𝑦𝑗subscript𝜏𝑒𝑛𝑡𝑎𝑖𝑙P_{\textit{NLI}}(y_{i}\Rightarrow y_{j})>\tau_{\textit{entail}}italic_P start_POSTSUBSCRIPT NLI end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⇒ italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) > italic_τ start_POSTSUBSCRIPT entail end_POSTSUBSCRIPT then
         EE{(xi,yi),(xj,yj)}𝐸𝐸subscript𝑥𝑖subscript𝑦𝑖subscript𝑥𝑗subscript𝑦𝑗E\leftarrow E\cup\{(x_{i},y_{i}),(x_{j},y_{j})\}italic_E ← italic_E ∪ { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) }
     end if
end for
G(𝒫in,E)𝐺subscript𝒫in𝐸G\leftarrow(\mathcal{P}_{\text{in}},E)italic_G ← ( caligraphic_P start_POSTSUBSCRIPT in end_POSTSUBSCRIPT , italic_E ) \triangleright define a graph where nodes are pairs and edges connect duplicate pairs
SConnected-Components(G)𝑆Connected-Components𝐺S\leftarrow\text{Connected-Components}(G)italic_S ← Connected-Components ( italic_G )
𝒫outsubscript𝒫out\mathcal{P}_{\text{out}}\leftarrow\emptysetcaligraphic_P start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ← ∅
for 𝒞S𝒞𝑆\mathcal{C}\in Scaligraphic_C ∈ italic_S do \triangleright find the max-entailing pair in each connected component
     pout=argmax(x,y)CP𝑁𝐿𝐼(xy)p_{\text{out}}=\text{argmax}_{(x,y)\in C}P_{\textit{NLI}}(x\Leftrightarrow y)italic_p start_POSTSUBSCRIPT out end_POSTSUBSCRIPT = argmax start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_C end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT NLI end_POSTSUBSCRIPT ( italic_x ⇔ italic_y )
     𝒫out𝒫out{pout}subscript𝒫outsubscript𝒫outsubscript𝑝out\mathcal{P}_{\text{out}}\leftarrow\mathcal{P}_{\text{out}}\cup\{p_{\text{out}}\}caligraphic_P start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ← caligraphic_P start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ∪ { italic_p start_POSTSUBSCRIPT out end_POSTSUBSCRIPT }
end for

Appendix A Implementation Details

A.1 Pair Generation

As noted in Section 3.5, we start off using GPT2-XL and BioGPT-large as teacher LM – all with 1.5B parameters – generating paraphrases in general and biomedical domain respectively. To sample contextual constraints from these LMs, we use Nucleus Sampling with top_p=0.9top_p0.9\textit{top\_p}=0.9top_p = 0.9 and 𝑡𝑒𝑚𝑝=1.0𝑡𝑒𝑚𝑝1.0\textit{temp}=1.0temp = 1.0. Additionally, we find that for general domain, generating new-style sentences by prompting GPT2 with a simple prefix (e.g., New York (CNN) --) leads to less noisy and more diverse context. For BioGPT, we free-form generate without any prefix given. Throughout the distillation process, we used 4 Quadro RTX 8000 GPUs.

A.2 Critic Models for Paraphrase Generation

For semantic equivalence filter, we use Roberta-large-WANLI Liu et al. (2022a), readily available at HuggingFace transformers Wolf et al. (2020). To leave only the highly semantically equivalent pairs of paraphrases, we use τsemantic=0.75subscript𝜏semantic0.75\tau_{\text{semantic}}=0.75italic_τ start_POSTSUBSCRIPT semantic end_POSTSUBSCRIPT = 0.75, discarding all pairs with the bidirectional entailment score below this threshold. For dissimilarity filter, we use τ𝑟𝑜𝑢𝑔𝑒=0.75subscript𝜏𝑟𝑜𝑢𝑔𝑒0.75\tau_{\textit{rouge}}=0.75italic_τ start_POSTSUBSCRIPT rouge end_POSTSUBSCRIPT = 0.75 and τ𝑇𝐸𝐷=12subscript𝜏𝑇𝐸𝐷12\tau_{\textit{TED}}=12italic_τ start_POSTSUBSCRIPT TED end_POSTSUBSCRIPT = 12. We use Stanford CoreNLP library to parse the constituency tree.

Finally, we present the formal algorithm of the diversity filter in Algorithm 1. We first create an undirected graph G𝐺Gitalic_G where pairs are nodes and edges exist between duplicate pairs, then find the set S𝑆Sitalic_S of all connected components in G𝐺Gitalic_G. By discarding all but one with the maximal entailment score p𝑁𝐿𝐼(xy)+p𝑁𝐿𝐼(yx)subscript𝑝𝑁𝐿𝐼𝑥𝑦subscript𝑝𝑁𝐿𝐼𝑦𝑥p_{\textit{NLI}}(x\Rightarrow y)+p_{\textit{NLI}}(y\Rightarrow x)italic_p start_POSTSUBSCRIPT NLI end_POSTSUBSCRIPT ( italic_x ⇒ italic_y ) + italic_p start_POSTSUBSCRIPT NLI end_POSTSUBSCRIPT ( italic_y ⇒ italic_x ) in each component, we remove the duplicate pairs in the candidate pool. As the duplicate pair search with NLI model is parallelizable, the time complexity follows that of the connected component search, i.e. O(|P|+|E|)𝑂𝑃𝐸O(|P|+|E|)italic_O ( | italic_P | + | italic_E | ) when using DFS-based algorithm Tarjan (1972).

Dataset Quora MSCOCO
Model iBLEU BLEU iBLEU BLEU
Lag VAE 8.73 15.52 7.69 11.63
CGMH 9.94 15.73 7.84 11.45
UPSA 12.03 18.21 9.26 14.16
BT 11.64 11.59 9.72 14.36
Corruption 12.32 17.97 10.32 15.60
ConRPG 12.68 18.31 11.17 16.98
MCPG 13.58 24.84 11.99 20.54
Impossible-T5 16.40 27.22 13.15 22.75
Table 8: Experimental results of Impossible-T5 and unsupervised paraphrase generation methods on Quora and MSCOCO. Impossible-T5 consistently outperforms all unsupervised baselines across both benchmarks, in both metrics.

Appendix B Comparison with Unsupervised Paraphrase Generation Methods

To better understand the effectiveness of Impossible Distillation, we conduct additional experiments that compare Impossible-T5 against state-of-the-art unsupervised methods for paraphrase generation (i.e. trained without human-written reference). Following prior works, we use Quora Sharma et al. (2019) and MSCOCO Lin et al. (2015) datasets repurposed for paraphrase generation. For baselines, we compare against Lag VAE He et al. (2019), CGMH Miao et al. (2019), UPSA Liu et al. (2020), BT Wieting and Gimpel (2018), Corruption Hegde and Patil (2020), ConRPG Meng et al. (2021), and MCPG Chen et al. (2022b). Following past works, we compute and report iBLEU and 4-gram BLEU of each system.

The results are as shown in Table 8. Impossible-T5 consistently outperforms all unsupervised baselines across both benchmarks, in both metrics.

Refer to caption
Figure 4: Screenshot of MTurk interface used for the human evaluation of model generated paraphrases.

Appendix C Generalization to Sentence Summarization

C.1 Critic Models for Summarization

In Impossible Distillation, data generation can easily be adapted to sentence summarization, by redefining the filters for the target task (§3.6). Here, we explain the details of the critic models used for sentence summarization. First, a good summary should be entailed by the original statement without hallucination. Unlike paraphrases, however, summaries allow omitting less important details in the original statement. Therefore, we modify the semantic equivalence filter in paraphrase generation to consider only the unidirectional entailment between x𝑥xitalic_x and y𝑦yitalic_y:

f𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐(x,y)=𝟙{p𝑁𝐿𝐼(xy)τ𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐}subscript𝑓𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐𝑥𝑦1subscript𝑝𝑁𝐿𝐼𝑥𝑦subscript𝜏𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐\begin{split}f_{\textit{semantic}}(x,y)=\mathbbm{1}\Bigl{\{}p_{\textit{NLI}}(x% \Rightarrow y)\geq\tau_{\textit{semantic}}\Bigr{\}}\end{split}start_ROW start_CELL italic_f start_POSTSUBSCRIPT semantic end_POSTSUBSCRIPT ( italic_x , italic_y ) = blackboard_1 { italic_p start_POSTSUBSCRIPT NLI end_POSTSUBSCRIPT ( italic_x ⇒ italic_y ) ≥ italic_τ start_POSTSUBSCRIPT semantic end_POSTSUBSCRIPT } end_CELL end_ROW

We use the same threshold value as for paraphrase generation, τ𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐=0.75subscript𝜏𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐0.75\tau_{\textit{semantic}}=0.75italic_τ start_POSTSUBSCRIPT semantic end_POSTSUBSCRIPT = 0.75. Next, a desirable summary should be a concise representation of the original statement. We therefore discard all pairs whose compression ratio (i.e. the sequence length ratio of y𝑦yitalic_y to x𝑥xitalic_x) is larger than a threshold τcomp_ratiosubscript𝜏comp_ratio\tau_{\textit{comp\_ratio}}italic_τ start_POSTSUBSCRIPT comp_ratio end_POSTSUBSCRIPT:

fcomp_ratio(x,y)=𝟙{|y|<|x|τcomp_ratio}subscript𝑓comp_ratio𝑥𝑦1𝑦𝑥subscript𝜏comp_ratio\begin{split}f_{\textit{comp\_ratio}}(x,y)=\mathbbm{1}\Bigl{\{}|y|<|x|\cdot% \tau_{\textit{comp\_ratio}}\Bigr{\}}\end{split}start_ROW start_CELL italic_f start_POSTSUBSCRIPT comp_ratio end_POSTSUBSCRIPT ( italic_x , italic_y ) = blackboard_1 { | italic_y | < | italic_x | ⋅ italic_τ start_POSTSUBSCRIPT comp_ratio end_POSTSUBSCRIPT } end_CELL end_ROW

For our experiments, τcomp_ratio=0.8subscript𝜏comp_ratio0.8\tau_{\textit{comp\_ratio}}=0.8italic_τ start_POSTSUBSCRIPT comp_ratio end_POSTSUBSCRIPT = 0.8. Finally, we employ diversity filter as for paraphrase generation, removing all duplicate (source, summary) pairs from the generated dataset:

𝒟𝑇={(x,y)|(x,y)𝒞𝑇,f𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐fcomp_ratiof𝑑𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦(x,y)=1}subscript𝒟𝑇conditional-set𝑥𝑦formulae-sequence𝑥𝑦subscript𝒞𝑇subscript𝑓𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐subscript𝑓comp_ratiosubscript𝑓𝑑𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦𝑥𝑦1\begin{split}\mathcal{D}_{\textit{T}}=\{(x,y)|&(x,y)\in\mathcal{C}_{\textit{T}% },\\ &f_{\textit{semantic}}\land f_{\textit{comp\_ratio}}\land f_{\textit{diversity% }}(x,y)=1\}\end{split}start_ROW start_CELL caligraphic_D start_POSTSUBSCRIPT T end_POSTSUBSCRIPT = { ( italic_x , italic_y ) | end_CELL start_CELL ( italic_x , italic_y ) ∈ caligraphic_C start_POSTSUBSCRIPT T end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_f start_POSTSUBSCRIPT semantic end_POSTSUBSCRIPT ∧ italic_f start_POSTSUBSCRIPT comp_ratio end_POSTSUBSCRIPT ∧ italic_f start_POSTSUBSCRIPT diversity end_POSTSUBSCRIPT ( italic_x , italic_y ) = 1 } end_CELL end_ROW

C.2 DIMSUM and Impossible-T5

Other than the re-defined filters, we use the same settings as paraphrase generation throughout the distillation pipeline. After self-distillation, we yield a high-quality dataset for sentence summarization (Dataset of Impossible Summaries, or DIMSUM), with 1.5M sentence-summary pairs across news and biomedical domains. During this process, we also train T5-large into a specialized model for sentence summarization, which we consistently call Impossible-T5 as for paraphrase generation.

Appendix D Human Evaluation Details

For human evaluation, we recruit annotators from Amazon Mechanical Turk (MTurk) with an IRB approval, and ensure that all paraphrases are annotated by 6 distinct evaluators with Hit Rate over 99%. To minimize subjectivity, we use 3-point Likert scale where annotators evaluate the fluency (whether the paraphrase exhibits fleunt language), faithfulness (whether the paraphrase well preserves the content of the original sentence and does not hallucinate), and dissimilarity (whether the paraphrase is sufficiently different from the original statment) of each output. We compensate workers with the hourly wage of $15. Figure 4 shows the actual MTurk interface used for paraphrase evaluation.

Appendix E Limitations and Future Work

In this work, we limit our experiments to sentential paraphrasing and summarization tasks. In future works, Impossible Distillation could be applied to a broader range of tasks, e.g., translation. To generate a parallel corpus for translation without human supervision, Impossible Distillation could leverage the strong capability of multilingual LMs Lample and Conneau (2019); BigScience (2023) and cross-lingual filters Conneau et al. (2018).

Impossible Distillation makes use of a fixed set of filters (e.g., off-the-shelf NLI model) to determine which pair qualifies as a high-quality sample. Throughout the distillation pipeline, these filters remain frozen. Although our experiments show that the frozen filters are strong enough to distill a high-quality dataset than state-of-the-art paraphrase corpora, such filters may not always be accessible in wider range of tasks. Hence, future works could improve the framework by learning not only the task model that generates candidate pairs, but also the filter model that scores the plausibility of a given pair. We envision that by co-evolving the task model and filter model throughout the distillation stages, our framework could generalize to more complex problems such as commonsense reasoning, where it is non-trivial to define which pairs qualify as good task example.

As with any distillation technique, Impossible Distillation carries potential risk of amplifying undesirable properties of language models. While we focus on conditional generation tasks where the output is closely bound to the input, the trained model could inherit the bias and toxicity of its teacher in a more open-ended setting. Nonetheless, Impossible Distillation distills knowledge into a symbolic, textual dataset – which can be interpreted and evaluated, allowing users to intervene in the distillation process and selectively filter which knowledge to be amplified. The inherent transparency of Impossible Distillation, when incorporated with recent techniques for automatic bias detection and reduction, could empower safer knowledge transfer between language models.