Improving Open-Ended Text Generation via Adaptive Decoding

Wenhong Zhu    Hongkun Hao    Zhiwei He    Yiming Ai    Rui Wang
Abstract

Current language models decode text token by token according to probabilistic distribution, and determining the appropriate candidates for the next token is crucial to ensure generation quality. This study introduces adaptive decoding, a mechanism that dynamically empowers language models to ascertain a sensible candidate set during generation. Specifically, we introduce an entropy-based metric called confidence and conceptualize determining the optimal candidate set as a confidence-increasing process. The rationality of including a token in the candidate set is assessed by leveraging the increment of confidence. Experimental results reveal that our method balances diversity and coherence well. The human evaluation shows that our method can generate human-preferred text. Additionally, our method can potentially improve the reasoning ability of language models.

Machine Learning, ICML

1 Introduction

When it comes to open-ended text generation tasks, the aim is to produce coherent and fluent output based on limited input information (Holtzman et al., 2020). These tasks span various applications, including casual conversation (Thoppilan et al., 2022), storytelling (Mostafazadeh et al., 2016), and similar domains. Language models (LMs) presently estimate the likelihood of the next token in text generation, and they have gained extensive application in various tasks (Brown et al., 2020; Touvron et al., 2023). One factor that determines the quality of generation is the decoding algorithm.

For LMs, training solely on maximum likelihood estimation (MLE) without additional fine-tuning, such as human preference alignment, often raises distinct degradation issues (Holtzman et al., 2020). These challenges are closely intertwined with decoding algorithms. The decoding algorithm can be seen as a mechanism to truncate the entire vocabulary space into the next candidate token set. When this set is limited, as seen in greedy decoding and beam search, it leads to heightened repetition as the model favors previously decoded tokens (Li et al., 2020; Zhu et al., 2023). Conversely, when this set is expansive, as in top-p𝑝pitalic_p sampling, it can result in incoherence and hallucination (Xu et al., 2023). This occurs because certain improbable tokens still contribute to the cumulative probability, causing issues when sampling these tokens. Numerous efforts have been devoted to addressing these challenges. Some endeavors focus on enhancing diversity within the specified sampling space (Su et al., 2022; Xu et al., 2023), while others strive to identify a rational sampling space (Meister et al., 2023b; Hewitt et al., 2022).

We provide one human-written text example for the GPT2-XL model and use teacher-forcing decoding to demonstrate the distinction between the human-preferred and LM-preferred tokens. In doing so, we save the probability of human-preferred and LM-prefered tokens. As shown in LABEL:fig:generation_1, the choices made by humans and the LM predominantly align at high probability points. The distinction lies in that the portion of the token selected by humans tends to fall within the low probability region predicted by the LM. Simultaneously, the model tends to choose tokens with relatively higher probability. This suggests that the selection of the next token could have many possibilities.

On the other hand, the distribution predicted by the LM exhibits a long-tailed pattern (Holtzman et al., 2020), indicating that choosing the next token following the previous context is constrained. The main difference in the long tail distribution is the head shape. For example, as shown in LABEL:fig:generation_2, the distribution is flattened, indicating that the LM has multiple choices of the next token. The diversity of generated text may be compromised if our candidate set is too limited. While in LABEL:fig:generation_3, the distribution is on a sharp shape, implying the restricted choices. In this scenario, an excessively large candidate set could lead to a loss of coherence in the generated text. These observations raise the question: how do we determine the rational size of the candidate set to ensure both diversity and coherence in the generated text?

In this study, we present an adaptive decoding algorithm designed to dynamically determine the size of a suitable candidate set during generation. Entropy is a metric to quantify the disorder or randomness within a system. It would be useful in gauging the confidence of an LM in predicting the probability distribution of the next token. We conceptualize identifying the optimal candidate set as an confidence-increasing procedure. Specifically, we establish a system embracing two sets: a candidate set, initially empty, and an ordered set, initially encompassing the entire vocabulary. One metric named confidence was formulated to gauge system confidence based on entropy. Continuously selecting tokens with the highest probability from the ordered set and appending them to the candidate set significantly enhances the confidence of this system. The rationality of a token to join the candidate set is assessed based on the increment of confidence, allowing the model to dynamically determine a suitable set of candidates.

In our experiments, we performed two open-ended text generation tasks: document continuation and story generation. The results suggest that our approach significantly enhances diversity while preserving coherence in GPT2-XL(1.5B) and Llama2-7B models. On Llama2-7B-chat, coherence has been enhanced alongside increased diversity. Additionally, two mathematical reasoning tasks demonstrate that our method potentially improves the reasoning ability of language models. Human evaluation results show that our approach produces text that is more human-preferred.

In summary, the main contributions of this paper are as follows:

  • We integrate the concept of entropy reduction into the text decoding process and design a confidence increment as the metric to assess the rationality of a token as a candidate.

  • We introduce a novel adaptive decoding mechanism that enables LMs to determine an appropriately sized set of candidates autonomously during the generation. 111The code is available at https://github.com/zwhong714/adaptive_decoding.

  • Adaptive decoding effectively generates text in a style resembling human expression, ensuring a balance between coherence and diversity.

2 Problem Formulation

2.1 Language Models

We follow the definitions given in Hewitt et al. (2022). Let the random variable 𝑿=(X1,X2,,XT)𝑿subscript𝑋1subscript𝑋2subscript𝑋𝑇\boldsymbol{X}=(X_{1},X_{2},...,X_{T})bold_italic_X = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) represents a sequence of tokens, where each Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT belongs to a finite vocabulary 𝒱𝒱\mathcal{V}caligraphic_V. We use 𝒙<tsubscript𝒙absent𝑡\boldsymbol{x}_{<t}bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT to indicate a given prefix, xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as a specific word in 𝒱𝒱\mathcal{V}caligraphic_V, and x𝑥xitalic_x as a random token in 𝒱𝒱\mathcal{V}caligraphic_V. An auto-regressive LM is a distribution Pθ(𝑿)subscript𝑃𝜃𝑿P_{\theta}(\boldsymbol{X})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_X ) with parameters θ𝜃\thetaitalic_θ, which is factorized as Pθ(x)=t=1TPθ(xt|𝒙<t)subscript𝑃𝜃𝑥superscriptsubscriptproduct𝑡1𝑇subscript𝑃𝜃conditionalsubscript𝑥𝑡subscript𝒙absent𝑡P_{\theta}(x)=\prod_{t=1}^{T}P_{\theta}(x_{t}|\boldsymbol{x}_{<t})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ). We denote Pθ(Xt|𝒙<t)subscript𝑃𝜃conditionalsubscript𝑋𝑡subscript𝒙absent𝑡P_{\theta}(X_{t}|\boldsymbol{x}_{<t})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) as the conditional distribution of the LM given the context 𝒙<tsubscript𝒙absent𝑡\boldsymbol{x}_{<t}bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT over 𝒱𝒱\mathcal{V}caligraphic_V. The standard method for training an LM using maximum likelihood estimation of the training data 𝒟𝒟\mathcal{D}caligraphic_D, where log-likelihood is defined as:

(𝜽)=𝑿𝒟logP𝜽(𝑿).𝜽subscript𝑿𝒟subscript𝑃𝜽𝑿\mathcal{L}(\boldsymbol{\theta})=\sum_{\boldsymbol{X}\in\mathcal{D}}\log P_{% \boldsymbol{\theta}}(\boldsymbol{X}).caligraphic_L ( bold_italic_θ ) = ∑ start_POSTSUBSCRIPT bold_italic_X ∈ caligraphic_D end_POSTSUBSCRIPT roman_log italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_X ) . (1)

This is also equivalent to minimizing the forward cross-entropy between the true distribution P(𝑿)superscript𝑃𝑿P^{*}(\boldsymbol{X})italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_X ) and Pθ(𝑿)subscript𝑃𝜃𝑿P_{\theta}(\boldsymbol{X})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_X ):

E𝑿Pt=1|𝑿|P(x|𝒙<t)logPθ(x|𝒙<t)subscript𝐸similar-to𝑿superscript𝑃superscriptsubscript𝑡1𝑿superscript𝑃conditional𝑥subscript𝒙absent𝑡subscript𝑃𝜃conditional𝑥subscript𝒙absent𝑡-E_{\boldsymbol{X}\sim P^{*}}\sum_{t=1}^{|\boldsymbol{X}|}P^{*}(x|\boldsymbol{% x}_{<t})\log{P_{\theta}(x|\boldsymbol{x}_{<t})}- italic_E start_POSTSUBSCRIPT bold_italic_X ∼ italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_italic_X | end_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x | bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) roman_log italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) (2)

And recent LMs have achieved remarkably low forward cross-entropy (Radford et al., 2019; Touvron et al., 2023).

Recall-prioritization

However, model training with MLE has inherent limitations for matching model distribution and that of human language (Meister et al., 2023b). This is essentially because cross-entropy is recall-prioritized (Meister et al., 2023a). Recall here is defined as the coverage of the model distribution Pθsubscript𝑃𝜃P_{\theta}italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT over the true distribution Psuperscript𝑃P^{*}italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, indicating that a high recall implies tokens with high likelihood under Psuperscript𝑃P^{*}italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT also have high likelihood under Pθsubscript𝑃𝜃P_{\theta}italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. In other words, cross-entropy focuses on increasing the model likelihood of the ground-truth next token (Ren et al., 2024). Therefore, according to Equation 1, Pθ(xt|𝒙<t)subscript𝑃𝜃conditionalsubscript𝑥𝑡subscript𝒙absent𝑡P_{\theta}(x_{t}|\boldsymbol{x}_{<t})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) must be nonzero for all tokens xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in every string X𝑋Xitalic_X in the training dataset 𝒟𝒟\mathcal{D}caligraphic_D, which leads to multiple possible candidates for the next token. Besides, label smoothing has become a de-facto standard for large-scale training. In practice, there is more than one target next token during training due to the entropy regularization applied to the loss. As indicated in Table 1, the model prediction for the next token may not be precisely identical to the teacher-forced token. Nevertheless, it is essential to recognize that the model exhibits a significant capability in predicting the next token.

Table 1: Preliminary experiment: HIT@k𝑘kitalic_k is a metric that measures the human preferred token falls into the top-k𝑘kitalic_k space of an LM predicted distribution. Results obtained by 1000 human-written text from WikiText-103.

Model hit@1 hit@3 hit@5
gpt2 32.37 49.89 57.01
gpt2-xl 37.09 55.21 62.46
Llama2-7B 47.14 65.40 72.08
Llama2-chat-7B 45.04 62.81 69.57

Text generation

LMs can assess the likelihood of existing sequences and produce new tokens through the expression xPθ(𝑿)similar-to𝑥subscript𝑃𝜃𝑿x\sim P_{\theta}(\boldsymbol{X})italic_x ∼ italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_X ). The common application involves the generation of open-ended text. When presented with a sequence of m𝑚mitalic_m tokens extracted from natural language, denoted as 𝒞={x1,,xm}𝒞subscript𝑥1subscript𝑥𝑚\mathcal{C}=\{x_{1},...,x_{m}\}caligraphic_C = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }, serving as context, the LM aims to construct an n𝑛nitalic_n-tokens extension. This extension is decoded using the probability distribution predicted by the LM: Pθ(xm+1:xm+n|𝒞)=t=1nPθ(xt|𝒞,xm+1,,xm+t1)P_{\theta}(x_{m+1}:x_{m+n}|\mathcal{C})=\prod_{t=1}^{n}P_{\theta}(x_{t}|% \mathcal{C},x_{m+1},...,x_{m+t-1})italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT : italic_x start_POSTSUBSCRIPT italic_m + italic_n end_POSTSUBSCRIPT | caligraphic_C ) = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_C , italic_x start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m + italic_t - 1 end_POSTSUBSCRIPT ). The generation of the continuation occurs token by token, employing a specific decoding strategy.

2.2 Degeneration Problems

Repetition

Repetition often arises due to models trained using MLE without any intervention, leading to a gradual increase in the probability of tokens that have already occurred (Xu et al., 2022; Zhu et al., 2023). This tendency becomes particularly pronounced when utilizing greedy decoding or beam search (Holtzman et al., 2020). While this behavior aligns with the correct objective of MLE-trained models to continually maximize likelihood in the text generation process, it falls short of generating text preferred by humans. As illustrated in Table 2, this issue appears to be independent of the size of the model parameters. Even the GPT-3 base model(175B) exhibits significant repetition problems when using greedy decoding.

Incoherence

Many decoding algorithms have been proposed to truncate low probability tokens directly and have proven to be useful (Holtzman et al., 2020; Hewitt et al., 2022; Meister et al., 2023b). They typically compute the following truncated distribution at each time step:

Ptrunc (x𝒙<t)={Pθ(x𝒙<t)/Z𝒙<tx𝒜𝒙<t0 o.w, subscript𝑃trunc conditional𝑥subscript𝒙absent𝑡casessubscript𝑃𝜃conditional𝑥subscript𝒙absent𝑡subscript𝑍subscript𝒙absent𝑡𝑥subscript𝒜subscript𝒙absent𝑡0 o.w, P_{\text{trunc }}\left(x\mid\boldsymbol{x}_{<t}\right)=\begin{cases}P_{\theta}% \left(x\mid\boldsymbol{x}_{<t}\right)/Z_{\boldsymbol{x}_{<t}}&x\in\mathcal{A}_% {\boldsymbol{x}_{<t}}\\ 0&\text{ o.w, }\end{cases}italic_P start_POSTSUBSCRIPT trunc end_POSTSUBSCRIPT ( italic_x ∣ bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) = { start_ROW start_CELL italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ∣ bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) / italic_Z start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x ∈ caligraphic_A start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL o.w, end_CELL end_ROW (3)

where we denote 𝒜𝒙<t𝒱subscript𝒜subscript𝒙absent𝑡𝒱\mathcal{A}_{\boldsymbol{x}_{<t}}\subseteq\mathcal{V}caligraphic_A start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊆ caligraphic_V as the allowed set comprising candidate following tokens for a given prefix, and Z𝒙<t=x𝒜𝒙<tPθ(x|𝒙<t)subscript𝑍subscript𝒙absent𝑡subscript𝑥subscript𝒜subscript𝒙absent𝑡subscript𝑃𝜃conditional𝑥subscript𝒙absent𝑡Z_{\boldsymbol{x}_{<t}}=\sum_{x\in\mathcal{A}_{\boldsymbol{x}_{<t}}}P_{\theta}% (x|\boldsymbol{x}_{<t})italic_Z start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_A start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) is the renormalization term.

Effective truncation is the key to improving the quality of text generation. Otherwise, there is a risk of incoherent and hallucinatory outputs, as the model may sample from low-probability tokens that may not be related to the prefix. As we can see from Table 2, the text is highly incoherent with the text generation going on when we use top-p𝑝pitalic_p sampling (Holtzman et al., 2020).

Table 2: Degeneration problem: repetition and incoherence

GPT3 base model (davici-002), greedy
Prefix: “Barack Obama was born in Honolulu, Hawaii. He was born in”
(Repetition) Continuation: “1961. He was born on August 4, 1961. Barack Obama was born in Honolulu, Hawaii on August 4, 1961. Barack Obama was born in Honolulu”
GPT2-XL base model, top-p=0.95𝑝0.95\boldsymbol{p=0.95}bold_italic_p bold_= bold_0.95
Prefix: “Barack Obama was born in Honolulu, Hawaii. He was born in”
(Incoherence) Continuation: “Hawaii, and he is a legitimate U.S. citizen. He’s the president, and I’m the president of the United States.” — Nov. 14, 2009”

3 Preliminaries

Definition 3.1.

Given a discrete random variable X𝑋Xitalic_X, which takes values in the vocabulary 𝒱𝒱\mathcal{V}caligraphic_V and is distributed according to p:Pθ(X)[0,1]:𝑝subscript𝑃𝜃𝑋01p:P_{\theta}(X)\rightarrow[0,1]italic_p : italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X ) → [ 0 , 1 ], the entropy is defined as

H(X):=x𝒱p(x)logp(x).assignH𝑋subscript𝑥𝒱𝑝𝑥𝑝𝑥\mathrm{H}(X):=-\sum_{x\in\mathcal{V}}p(x)\log p(x).roman_H ( italic_X ) := - ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_V end_POSTSUBSCRIPT italic_p ( italic_x ) roman_log italic_p ( italic_x ) .
Theorem 3.2.

For any random variable X𝑋Xitalic_X,

0H(X)log|𝒱|,0𝐻𝑋𝒱0\leq H(X)\leq\log|\mathcal{V}|,0 ≤ italic_H ( italic_X ) ≤ roman_log | caligraphic_V | ,

where |𝒱|𝒱|\mathcal{V}|| caligraphic_V | denotes the size of the vocabulary 𝒱𝒱\mathcal{V}caligraphic_V. The upper bound is tight if and only if X𝑋Xitalic_X is distributed uniformly on 𝒱𝒱\mathcal{V}caligraphic_V. The lower bound is tight if and only if X𝑋Xitalic_X is deterministic.

By Theorem 3.2, the upper bound is tight when token distribution predicted by the LM is uniform, which means each token in the vocabulary 𝒱𝒱\mathcal{V}caligraphic_V has the same probability of being sampled. However, the upper bound is often impossible to achieve in the real world. Moreover, the true upper bound is frequently linked to the model’s capability to predict the next token.

To mitigate the impact of the upper bound, we normalized entropy using the min-max approach. (1) This ensures that all changes fall within a similar range, enabling the comparison and analysis of distributions predicted by different models. (2) It also facilitates comparing and analyzing entropy reduction processes.

Definition 3.3.

For any random variable X𝑋Xitalic_X, the confidence Conf(X)Conf𝑋\text{Conf}(X)Conf ( italic_X ) is defined as the min-max scaling of entropy, and the range is [0,1]01[0,1][ 0 , 1 ]:

Conf(X):=1+x𝒱p(x)logp(x)log|𝒱|.assignConf𝑋1subscript𝑥𝒱𝑝𝑥𝑝𝑥𝒱\text{Conf}(X):=1+\frac{\sum_{x\in\mathcal{V}}p(x)\log p(x)}{\log|\mathcal{V}|}.Conf ( italic_X ) := 1 + divide start_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_V end_POSTSUBSCRIPT italic_p ( italic_x ) roman_log italic_p ( italic_x ) end_ARG start_ARG roman_log | caligraphic_V | end_ARG .

The expression x𝒱p(x)logp(x)subscript𝑥𝒱𝑝𝑥𝑝𝑥\sum_{x\in\mathcal{V}}p(x)\log p(x)∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_V end_POSTSUBSCRIPT italic_p ( italic_x ) roman_log italic_p ( italic_x ) can be conceptually divided into two components. The first component is the known part ikpilogpisuperscriptsubscript𝑖𝑘subscript𝑝𝑖subscript𝑝𝑖\sum_{i}^{k}p_{i}\log p_{i}∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where k𝑘kitalic_k represents the number of known items. The second component is the unknown part, introducing maximum uncertainty, expressed as (1ikpi)log1ikpi|𝒱|k1superscriptsubscript𝑖𝑘subscript𝑝𝑖1superscriptsubscript𝑖𝑘subscript𝑝𝑖𝒱𝑘({1-\sum_{i}^{k}p_{i}})\log\frac{1-\sum_{i}^{k}p_{i}}{|\mathcal{V}|-k}( 1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log divide start_ARG 1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG | caligraphic_V | - italic_k end_ARG. Consequently, we can define a pivot k𝑘kitalic_k to divide the distribution Pθ(X)subscript𝑃𝜃𝑋P_{\theta}(X)italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X ) based on the number of known items. This metric related to k𝑘kitalic_k is as follows:

Confk(X)=1+subscriptConf𝑘𝑋limit-from1\displaystyle\text{Conf}_{k}(X)=1+Conf start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_X ) = 1 + 1log|𝒱|(ikpilogpi\displaystyle\frac{1}{\log|\mathcal{V}|}(\sum_{i}^{k}p_{i}\log p_{i}divide start_ARG 1 end_ARG start_ARG roman_log | caligraphic_V | end_ARG ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (4)
+(1ikpi)log1ikpi|𝒱|k),\displaystyle+({1-\sum_{i}^{k}p_{i}})\log\frac{1-\sum_{i}^{k}p_{i}}{|\mathcal{% V}|-k}),+ ( 1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log divide start_ARG 1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG | caligraphic_V | - italic_k end_ARG ) ,

where pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the abbreviation of p(xi)𝑝subscript𝑥𝑖p(x_{i})italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

3.1 Conf(X)k{}_{k}(X)start_FLOATSUBSCRIPT italic_k end_FLOATSUBSCRIPT ( italic_X ) Is Monotonically Increasing

We define the rationale for including a token in the candidate set as the extent to which the presence or absence of this token affects the model’s confidence. Specifically, it is defined as the difference in Equation 4 between neighboring pivots.

ΔConf=1log|𝒱|(Confk(X)Confk1(X))ΔConf1𝒱subscriptConf𝑘𝑋subscriptConf𝑘1𝑋\displaystyle\Delta\text{Conf}=\frac{1}{\log|\mathcal{V}|}(\text{Conf}_{k}(X)-% \text{Conf}_{k-1}(X))roman_Δ Conf = divide start_ARG 1 end_ARG start_ARG roman_log | caligraphic_V | end_ARG ( Conf start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_X ) - Conf start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_X ) ) (5)
=1log|𝒱|(pklogpk+(1ikpi)log(1ikpi)|𝒱|k\displaystyle=\frac{1}{\log|\mathcal{V}|}(p_{k}\log p_{k}+(1-\sum_{i}^{k}p_{i}% )\log\frac{(1-\sum_{i}^{k}p_{i})}{|\mathcal{V}|-k}= divide start_ARG 1 end_ARG start_ARG roman_log | caligraphic_V | end_ARG ( italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + ( 1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log divide start_ARG ( 1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG | caligraphic_V | - italic_k end_ARG
(1ik1pi)log(1ik1pi)|𝒱|k+1).\displaystyle-(1-\sum_{i}^{k-1}p_{i})\log\frac{(1-\sum_{i}^{k-1}p_{i})}{|% \mathcal{V}|-k+1}).- ( 1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log divide start_ARG ( 1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG | caligraphic_V | - italic_k + 1 end_ARG ) .

In Appendix C, we demonstrate that Equation 5 is greater than zero when Pθ(X)subscript𝑃𝜃𝑋P_{\theta}(X)italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X ) is sorted in decreasing order. Under this condition, confidence steadily increases as k𝑘kitalic_k continues to grow.

3.2 Descending Trend in ΔΔ\Deltaroman_ΔConf

Although ΔΔ\Deltaroman_ΔConf may lack a clear monotonic property, assessing its variability is feasible by examining its upper and lower bounds. Utilizing Corollary C.2, we can establish the lower bound as ΔConfpklog|𝒱|log(|𝒱|k+1)ΔConfsubscript𝑝𝑘𝒱𝒱𝑘1\Delta\text{Conf}\geq\frac{p_{k}}{\log|\mathcal{V}|}\log(|\mathcal{V}|-k+1)roman_Δ Conf ≥ divide start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG roman_log | caligraphic_V | end_ARG roman_log ( | caligraphic_V | - italic_k + 1 ), while employing both Corollary C.1 and Corollary C.2 allows us to derive the upper bound, namely ΔConf(1ik1pi)log|𝒱|log(|𝒱|k+1)ΔConf1superscriptsubscript𝑖𝑘1subscript𝑝𝑖𝒱𝒱𝑘1\Delta\text{Conf}\leq\frac{(1-\sum_{i}^{k-1}p_{i})}{\log|\mathcal{V}|}\log(|% \mathcal{V}|-k+1)roman_Δ Conf ≤ divide start_ARG ( 1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG roman_log | caligraphic_V | end_ARG roman_log ( | caligraphic_V | - italic_k + 1 ).

As the number of known items k𝑘kitalic_k increases, it is observed that the probability pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT decreases. Additionally, both log(|𝒱|k+1)𝒱𝑘1\log(|\mathcal{V}|-k+1)roman_log ( | caligraphic_V | - italic_k + 1 ) and (1ik1pi)1superscriptsubscript𝑖𝑘1subscript𝑝𝑖(1-\sum_{i}^{k-1}p_{i})( 1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) exhibit a decreasing trend. Ultimately, the upper and lower bounds converge to zero. This implies that ΔΔ\Deltaroman_ΔConf also demonstrates a decreasing trend, a pattern influenced by the interplay between the probability pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and the number of known items k𝑘kitalic_k.

3.3 Approximation by Truncation


Refer to caption

Figure 3: Preliminary experiment: Averaged top-20 token distribution of different LMs on 1000 human-written texts from WikitText-103.

As illustrated in Figure 3, the token distribution predicted by the LM exhibits a pronounced long-tailed pattern. Notably, this phenomenon becomes more conspicuous as the LM parameters increase and a human-alignment operation follows.

Corollary 3.4.

When a distribution exhibits a long-tail shape, the entropy of the truncated distribution H(X(k))H𝑋𝑘\mathrm{H}(X(k))roman_H ( italic_X ( italic_k ) ) is approximately equal to the original entropy and an error within a smaller range of ϵitalic-ϵ\epsilonitalic_ϵ.

H(X)=H(X(k))+ϵ,H𝑋H𝑋𝑘italic-ϵ\displaystyle\mathrm{H}(X)=\mathrm{H}(X(k))+\epsilon,roman_H ( italic_X ) = roman_H ( italic_X ( italic_k ) ) + italic_ϵ , (6)

where X(k)𝑋𝑘X(k)italic_X ( italic_k ) follows the truncated distribution as defined by the Equation 3.

Equation 6 shows we can determine a reasonable candidate set size k𝑘kitalic_k by setting a reasonable threshold ϵitalic-ϵ\epsilonitalic_ϵ.

4 Adaptive Decoding

Our method adheres that the optimal threshold must balance diversity (i.e., including as many tokens as possible in the candidate set) and coherence (i.e., avoiding the inclusion of tokens outside the true support) (Finlayson et al., 2024).


Refer to caption

Figure 4: How adaptive decoding works. Expanding the candidate set 𝒜𝒙<tsubscript𝒜subscript𝒙absent𝑡\mathcal{A}_{\boldsymbol{x}_{<t}}caligraphic_A start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT depends on the rationality of the token. This rationality is determined by the extent to which the model’s confidence increases when considering the inclusion of this token.

Initialization

For clarity, we define the known portion, as discussed in Section 3, as the next token candidate set 𝒜𝒙<tsubscript𝒜subscript𝒙absent𝑡\mathcal{A}_{\boldsymbol{x}_{<t}}caligraphic_A start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and the unknown part as 𝒙<tsubscriptsubscript𝒙absent𝑡\mathcal{B}_{\boldsymbol{x}_{<t}}caligraphic_B start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT. The initialization involves setting 𝒜𝒙<tsubscript𝒜subscript𝒙absent𝑡\mathcal{A}_{\boldsymbol{x}_{<t}}caligraphic_A start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT to an empty set and 𝒙<tsubscriptsubscript𝒙absent𝑡\mathcal{B}_{\boldsymbol{x}_{<t}}caligraphic_B start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT to an ordered collection of the entire vocabulary. The order is based on the descending probabilities predicted by the LM.

Truncation

As illustrated in Figure 4, identifying a suitable set of candidates can be viewed as an iterative procedure. This involves transferring the token with the highest probability from 𝒙<tsubscriptsubscript𝒙absent𝑡\mathcal{B}_{\boldsymbol{x}_{<t}}caligraphic_B start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT to 𝒜𝒙<tsubscript𝒜subscript𝒙absent𝑡\mathcal{A}_{\boldsymbol{x}_{<t}}caligraphic_A start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Subsequently, we calculate the incremental change in the Conf metric, as defined in Equation 5. This increment serves as a measure of the rationality of a token within the candidate set. The iteration continues until the rationality falls below a pre-set threshold, denoted as ϵitalic-ϵ\epsilonitalic_ϵ. At this point, we finalize 𝒜𝒙<tsubscript𝒜subscript𝒙absent𝑡\mathcal{A}_{\boldsymbol{x}_{<t}}caligraphic_A start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT as our truncation space.

Computational complexity

Efficient implementation of adaptive decoding is achievable through Algorithm 1. Firstly, sorting words by their predicted probabilities can be accomplished in 𝒪(|𝒱|log|𝒱|)𝒪𝒱𝒱\mathcal{O}(|\mathcal{V}|\log|\mathcal{V}|)caligraphic_O ( | caligraphic_V | roman_log | caligraphic_V | ). Secondly, the calculation of the increment ΔΔ\Deltaroman_ΔConf, as described in Equation 5, is a 𝒪(|𝒱|)𝒪𝒱\mathcal{O}(|\mathcal{V}|)caligraphic_O ( | caligraphic_V | ) operation. By getting ΔΔ\Deltaroman_ΔConf, we can effectively identify the last index with a value below the pre-set threshold in 𝒪(log|𝒱|)𝒪𝒱\mathcal{O}(\log|\mathcal{V}|)caligraphic_O ( roman_log | caligraphic_V | ). In summary, our algorithm exhibits a time complexity of 𝒪(|𝒱|log|𝒱|)𝒪𝒱𝒱\mathcal{O}(|\mathcal{V}|\log|\mathcal{V}|)caligraphic_O ( | caligraphic_V | roman_log | caligraphic_V | ). However, we can fully leverage the parallel computing power of GPUs to accelerate the computation process.

Algorithm 1 Adaptive Decoding Algorithm
0:  Prefix 𝒞={x1,,xm}𝒞subscript𝑥1subscript𝑥𝑚\mathcal{C}=\{x_{1},\dots,x_{m}\}caligraphic_C = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }; Language Model Pθsubscript𝑃𝜃P_{\theta}italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT; Vocabulary size |𝒱|𝒱|\mathcal{V}|| caligraphic_V |; Threshold ϵitalic-ϵ\epsilonitalic_ϵ
0:  Continuation 𝒢={xm+1,xm+2,,xm+n}𝒢subscript𝑥𝑚1subscript𝑥𝑚2subscript𝑥𝑚𝑛\mathcal{G}=\{x_{m+1},x_{m+2},\dots,x_{m+n}\}caligraphic_G = { italic_x start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_m + 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m + italic_n end_POSTSUBSCRIPT }
  𝒢{}𝒢\mathcal{G}\leftarrow\{\}caligraphic_G ← { }
  for t=m+1𝑡𝑚1t=m+1italic_t = italic_m + 1 to m+n𝑚𝑛m+nitalic_m + italic_n do
     pPθ(xt|𝒞)𝑝subscript𝑃𝜃conditionalsubscript𝑥𝑡𝒞p\leftarrow P_{\theta}(x_{t}|\mathcal{C})italic_p ← italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | caligraphic_C )
     psort(p,decreasing order)𝑝sort𝑝decreasing orderp\leftarrow\text{sort}(p,\text{decreasing order})italic_p ← sort ( italic_p , decreasing order )
     cumsumcumulative sum of pcumsumcumulative sum of 𝑝\text{cumsum}\leftarrow\text{cumulative sum of }pcumsum ← cumulative sum of italic_p
     residual_index|𝒱|range(1,|𝒱|+1)residual_index𝒱range1𝒱1\text{residual\_index}\leftarrow|\mathcal{V}|-\text{range}(1,|\mathcal{V}|+1)residual_index ← | caligraphic_V | - range ( 1 , | caligraphic_V | + 1 )
     term1plog(presidual_index1cumsum)𝑡𝑒𝑟𝑚1𝑝𝑝residual_index1cumsumterm1\leftarrow p\cdot\log\left(\frac{p\cdot\text{residual\_index}}{1-\text{% cumsum}}\right)italic_t italic_e italic_r italic_m 1 ← italic_p ⋅ roman_log ( divide start_ARG italic_p ⋅ residual_index end_ARG start_ARG 1 - cumsum end_ARG )
     term2log(1cumsumresidual_index)log(1cumsum+presidual_index+1)𝑡𝑒𝑟𝑚21cumsumresidual_index1cumsum𝑝residual_index1term2\leftarrow\log\left(\frac{1-\text{cumsum}}{\text{residual\_index}}\right)% -\log\left(\frac{1-\text{cumsum}+p}{\text{residual\_index}+1}\right)italic_t italic_e italic_r italic_m 2 ← roman_log ( divide start_ARG 1 - cumsum end_ARG start_ARG residual_index end_ARG ) - roman_log ( divide start_ARG 1 - cumsum + italic_p end_ARG start_ARG residual_index + 1 end_ARG )
     ΔConfterm1+(1cumsum+p)term2log|𝒱|ΔConf𝑡𝑒𝑟𝑚11cumsum𝑝𝑡𝑒𝑟𝑚2𝒱\Delta\text{Conf}\leftarrow\frac{term1+(1-\text{cumsum}+p)\cdot term2}{\log|% \mathcal{V}|}roman_Δ Conf ← divide start_ARG italic_t italic_e italic_r italic_m 1 + ( 1 - cumsum + italic_p ) ⋅ italic_t italic_e italic_r italic_m 2 end_ARG start_ARG roman_log | caligraphic_V | end_ARG
     kmax(1,LastIndex(ΔConf>ϵ))𝑘1LastIndexΔConfitalic-ϵk\leftarrow\max(1,\text{LastIndex}(\Delta\text{Conf}>\epsilon))italic_k ← roman_max ( 1 , LastIndex ( roman_Δ Conf > italic_ϵ ) )
     \triangleright top-k𝑘kitalic_k sampling
     xtsample from 𝒱(k)subscript𝑥𝑡sample from superscript𝒱𝑘x_{t}\leftarrow\text{sample from }\mathcal{V}^{(k)}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← sample from caligraphic_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT
     𝒢𝒢{xt}𝒢𝒢subscript𝑥𝑡\mathcal{G}\leftarrow\mathcal{G}\cup\{x_{t}\}caligraphic_G ← caligraphic_G ∪ { italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }
  end for
  Return: 𝒢𝒢\mathcal{G}caligraphic_G
Table 3: Main experiments are conducted by repeating the sampling process five times for each prefix, and the average score is reported. Metrics where the algorithm outperforms others are highlighted in bold. * denotes a significant difference compared to other methods (p<0.05𝑝0.05p<0.05italic_p < 0.05).
LM Decoding WritingPrompts WikiText-103
rep-2 rep-3 rep-4 Diversity MAUVE Coherence rep-2 rep-3 rep-4 Diversity MAUVE Coherence
Human 3.63 0.87 0.39 95.27 27.12 5.42 1.20 0.41 93.18 62.94
GPT2-XL greedy 82.00 80.62 79.66 1.82 0.61 56.27 71.86 68.33 65.95 6.98 2.03 57.81
top-k𝑘kitalic_k 12.78 6.68 4.70 80.50 22.28 47.69 9.96 3.57 1.96 86.08 77.15 61.19
mirostat 58.48 51.81 47.32 18.17 5.61 50.10 45.38 36.21 30.74 31.11 36.06 60.51
top-p𝑝pitalic_p 12.72 6.86 4.96 80.24 21.20 48.01 8.11 2.74 1.56 88.82 78.22 60.00
typical 13.28 7.62 5.83 79.51 20.36 46.98 6.73 2.13 1.19 90.71 78.49 59.66
η𝜂\etaitalic_η-sampling 8.92 4.43 3.33 86.26 22.12 45.15 7.80 2.55 1.45 89.44 77.94 59.90
adaptive (ours) 6.86 3.66 3.07 89.30 25.18 37.71 5.20 1.43 0.80 93.23 79.57 55.98
Human 7.44 1.59 0.64 90.67 32.84 4.73 1.00 0.33 94.10 60.34
Llama2-7B greedy 77.32 74.67 72.84 5.58 1.51 46.40 43.71 36.70 32.09 34.54 22.28 55.85
top-k𝑘kitalic_k 9.13 4.45 3.27 87.18 23.57 23.02 7.18 2.44 1.20 90.01 74.80 58.49
top-p𝑝pitalic_p 8.33 4.10 3.29 88.38 23.60 23.05 7.07 2.49 1.33 90.05 77.19 58.19
mirostat 13.93 8.49 6.65 79.71 21.39 56.23 12.45 6.14 3.92 80.63 69.22 65.79
typical 7.16 3.55 2.91 90.33 25.78 22.56 7.06 2.48 1.27 90.08 76.04 57.96
η𝜂\etaitalic_η-sampling 6.60 2.96 2.35 90.99 24.43 23.19 5.93 1.84 0.90 91.96 78.26 57.46
adaptive (ours) 7.02 2.63 1.58 89.98 35.26 35.10 6.81 2.31 1.24 90.55 76.74 58.22

5 Experiments

5.1 Setups

Models

We assess the performance of our method on the GPT2-XL model (Radford et al., 2019), the Llama2-7B model, and its associated chat version (Touvron et al., 2023).

Datasets

We explore two open-ended text generation applications: document continuation using the WikiText-103 dataset (Merity et al., 2017), which contains a large collection of Wikipedia articles. Another is story generation on the WritingPrompts dataset (Fan et al., 2018), a notably challenging endeavor. This task involves crafting imaginative continuations based on abstract, high-level story prompts contributed by online users.

Baselines

Our proposed algorithm is compared against various decoding algorithms. Top-k𝑘kitalic_k method (Fan et al., 2018) maximizes x𝒜𝒙<tPθ(x𝒙<t)subscript𝑥subscript𝒜subscript𝒙absent𝑡subscript𝑃𝜃conditional𝑥subscript𝒙absent𝑡\sum_{x\in\mathcal{A}_{\boldsymbol{x}_{<t}}}P_{\theta}\left(x\mid\boldsymbol{x% }_{<t}\right)∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_A start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ∣ bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) while constraining |𝒜𝒙<t|=ksubscript𝒜subscript𝒙absent𝑡𝑘|\mathcal{A}_{\boldsymbol{x}_{<t}}|=k| caligraphic_A start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT | = italic_k. Here, k𝑘kitalic_k serves as the predefined hyperparameter. Top-p𝑝pitalic_p method (Holtzman et al., 2020) minimizes |𝒜𝒙<t|subscript𝒜subscript𝒙absent𝑡|\mathcal{A}_{\boldsymbol{x}_{<t}}|| caligraphic_A start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT | while constraining x𝒜𝒙<tPθ(x𝒙<t)psubscript𝑥subscript𝒜subscript𝒙absent𝑡subscript𝑃𝜃conditional𝑥subscript𝒙absent𝑡𝑝\sum_{x\in\mathcal{A}_{\boldsymbol{x}_{<t}}}P_{\theta}\left(x\mid\boldsymbol{x% }_{<t}\right)\geq p∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_A start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ∣ bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) ≥ italic_p. p𝑝pitalic_p is the preset hyperparameter. Mirostat method (Basu et al., 2021) is perplexity-controlled by estimating |𝒜𝒙<t|subscript𝒜subscript𝒙absent𝑡|\mathcal{A}_{\boldsymbol{x}_{<t}}|| caligraphic_A start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT | based on Zipf’s law and the target cross-entropy. Typical decoding (Meister et al., 2023b) tries to minimize the absolute distance x𝒜𝒙<tH(Xt𝑿<t=𝒙<t)+logPθ(x𝒙<t)|\sum_{x\in\mathcal{A}_{\boldsymbol{x}_{<t}}}\mid\mathrm{H}\left(X_{t}\mid% \boldsymbol{X}_{<t}=\boldsymbol{x}_{<t}\right)+\log P_{\theta}\left(x\mid% \boldsymbol{x}_{<t}\right)|∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_A start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ roman_H ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_italic_X start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) + roman_log italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ∣ bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) |, by constraining the x𝒜𝒙<tPθ(x𝒙<t)τsubscript𝑥subscript𝒜subscript𝒙absent𝑡subscript𝑃𝜃conditional𝑥subscript𝒙absent𝑡𝜏\sum_{x\in\mathcal{A}_{\boldsymbol{x}_{<t}}}P_{\theta}\left(x\mid\boldsymbol{x% }_{<t}\right)\geq\tau∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_A start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ∣ bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) ≥ italic_τ. τ𝜏\tauitalic_τ is the preset hyperparameter. η𝜂\etaitalic_η-sampling (Hewitt et al., 2022) considers the allowed set 𝒜𝒙<t={x𝒱|Pθ(x|𝒙<t)>η}subscript𝒜subscript𝒙absent𝑡conditional-set𝑥𝒱subscript𝑃𝜃conditional𝑥subscript𝒙absent𝑡𝜂\mathcal{A}_{\boldsymbol{x}_{<t}}=\{x\in\mathcal{V}|P_{\theta}(x|\boldsymbol{x% }_{<t})>\eta\}caligraphic_A start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { italic_x ∈ caligraphic_V | italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) > italic_η }, where η=min(ϵ,αexpH(Xt𝑿<t=𝒙<t))}\left.\eta=\min\left(\epsilon,\alpha\exp\mathrm{H}\left(X_{t}\mid\boldsymbol{X% }_{<t}=\boldsymbol{x}_{<t}\right)\right)\right\}italic_η = roman_min ( italic_ϵ , italic_α roman_exp roman_H ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_italic_X start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) ) }, and α𝛼\alphaitalic_α and ϵitalic-ϵ\epsilonitalic_ϵ are hyperparameters.

Automatic evaluation

Repetition is defined as rep-n=1.0#unique n-gram#total n-gramrep-n1.0#unique n-gram#total n-gram\textbf{rep-n}=1.0-\frac{\#\text{unique n-gram}}{\#\text{total n-gram}}rep-n = 1.0 - divide start_ARG # unique n-gram end_ARG start_ARG # total n-gram end_ARG to measure sequence-level repetition according to the portion of duplicate n-grams (Welleck et al., 2020). Diversity considers the repetition of generated text at different n𝑛nitalic_n-gram levels and can be calculated as follows: n=24(1.0rep-n100)superscriptsubscriptproduct𝑛241.0rep-n100\prod_{n=2}^{4}(1.0-\frac{\textbf{rep-n}}{100})∏ start_POSTSUBSCRIPT italic_n = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( 1.0 - divide start_ARG rep-n end_ARG start_ARG 100 end_ARG ) (Meister et al., 2023b). MAUVE (Pillutla et al., 2021) is a metric that quantifies the similarity in token distribution between generated text and human-written text. Semantic coherence is defined as v𝒞v𝒢/(|v𝒞||v𝒢|)superscriptsubscript𝑣𝒞topsubscript𝑣𝒢subscript𝑣𝒞subscript𝑣𝒢v_{\mathcal{C}}^{\top}v_{\mathcal{G}}/\left(\left|v_{\mathcal{C}}\right|\cdot% \left|v_{\mathcal{G}}\right|\right)italic_v start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT / ( | italic_v start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT | ⋅ | italic_v start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT | ), where v𝒞=SimCSE(𝒞)subscript𝑣𝒞SimCSE𝒞v_{\mathcal{C}}=\operatorname{SimCSE}(\mathcal{C})italic_v start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT = roman_SimCSE ( caligraphic_C ) and v𝒢=SimCSE(𝒢)subscript𝑣𝒢SimCSE𝒢v_{\mathcal{G}}=\operatorname{SimCSE}(\mathcal{G})italic_v start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT = roman_SimCSE ( caligraphic_G ). It is computed as the cosine similarity between the sentence embeddings of the prefix and the continuation, represented by SimCSE (Gao et al., 2021).

5.2 Implementation Details

We randomly select 1200 data samples from the training set of each dataset, use 1000 samples of them to evaluate different decoding algorithms and use the remaining 200 samples to select hyperparameters. Hyperparameter scans can be found in Appendix D, and we select the hyperparameters that result in the optimal MAUVE score (Meister et al., 2023b) to guarantee fairness. In the context of WritingPrompts, we directly employ the original prompt from the dataset to feed models to execute the content continuation task. In the case of WikiText-103, a prefix with fixed-length 32 is utilized for the continuation task. The maximum generation length is constrained to 256 tokens for both datasets. To ensure robustness, we conduct the sampling process five times for each prefix and subsequently report the average score. Furthermore, all decoding methods adopt multinomial sampling within the truncated space.

5.3 Results

The primary experiments are presented in Table 3. The MAUVE score is measured by referencing human-written texts. From the perspective of human texts, there is a high diversity. Since the WritingPrompts can have many story continuations given a prompt, coherence on this dataset is low even when written by humans. It can be seen that using the greedy decoding method for LMs trained by MLE would produce a serious repetition phenomenon, leading to high coherence but low MAUVE metric and low diversity. This indicates that the generated text is in contrast to the human-written text. Conversely, this phenomenon is significantly improved when we expand the set of allowable samples for the next token. Therefore, we must combine these metrics to analyze the results in Table 3.

When utilizing the adaptive decoding method, the two LMs generally perform better than other methods on two datasets regarding MAUVE and diversity. Besides, on the WritingPrompts dataset, we improved coherence compared with written text by humans. This observation suggests that our approach is well-suited for datasets like WritingPrompts, which excels in producing higher-quality text with limited prompts. In contrast to our approach, Mirostat determined the size of candidate sets by managing perplexity, resulting in a notable enhancement in coherence. Nevertheless, upon examining the diversity and MAUVE indicators, it appears this could be attributed to partial repetition of generation. Additionally, we found that η𝜂\etaitalic_η-sampling is quite competitive to our proposed method, and we perform a detailed comparison in Section 6.

5.4 Human Evaluation

We randomly select 200 samples from WikiText-103 for human evaluation, as the prefixes of this dataset offer ample contextual information. The evaluation guidance is outlined in Appendix F, with the results presented in Table 4.

Table 4: Human evaluation of 200 WikiText-103 samples using the Llama2-7B model. * denotes a significant difference compared to other methods (p<0.05𝑝0.05p<0.05italic_p < 0.05).

A Is Better Neutral B Is Better
adaptive 4.000 45.000 51.000 human
adaptive 64.500 17.000 18.500 top-k𝑘kitalic_k
adaptive 64.000 15.000 21.000 top-p𝑝pitalic_p
adaptive 50.125 30.875 19.000 η𝜂\etaitalic_η-sampling

Observing Table 4, it becomes evident that while the adaptive decoding algorithm falls short of achieving parity with human-generated content, a notable 45% of the evaluated data poses a challenge in distinguishing between human-written and adaptively decoded text. Simultaneously, it is apparent that text decoded through adaptive decoding surpasses that generated through top-k𝑘kitalic_k, top-p𝑝pitalic_p, and η𝜂\etaitalic_η-sampling algorithms. Furthermore, during human evaluation, we found that η𝜂\etaitalic_η-sampling occasionally generates excessively short text.

5.5 Adaptive Decoding for Policy Fine-tuned Model

Llama 2-chat undergoes fine-tuning with an objective aligned to human preferences rather than MLE, as detailed in Touvron et al. (2023). We evaluate the performance of adaptive decoding compared to top-p𝑝pitalic_p sampling on the Llama2-7B-chat model. We utilized its official generation configuration with the temperature set to 0.6. As illustrated in Table 5, our decoding algorithm demonstrates the ability to enhance generation quality. This improvement is marked by increased diversity without compromising coherence.

Table 5: Experiments are conducted by repeating the sampling process five times for each prefix from WritingPrompts, and the average score is reported. p𝑝pitalic_p is set to 0.950.950.950.95 while threshold for adaptive decoding is set to 0.00050.00050.00050.0005. * denotes a significant difference compared to other methods (p<0.05𝑝0.05p<0.05italic_p < 0.05).
Decoding Llama2-7B-chat
MAUVE Diversity Coherence
top-p𝑝pitalic_p 31.85 87.41 49.85
adaptive 37.06 93.99 54.39

Besides, we demonstrate the usefulness of our method on real-world test sets such as the MT-Bench benchmark (Zheng et al., 2024). We utilize GPT-4 (OpenAI et al., 2023) to assess the generations produced by top-p𝑝pitalic_p and adaptive decoding methods, maintaining the same evaluation settings as Zheng et al. (2024). This approach demonstrates a notable agreement rate exceeding 80% between human judges and GPT-4 evaluation. Regarding MT-bench experiments for LLama2-7B-chat testing, the results are summarized in Table 6:

Table 6: Evaluating text generation on the MT-Bench dataset using GPT-4 (score: 010similar-to0100\sim 100 ∼ 10). p𝑝pitalic_p is set to 0.950.950.950.95 while the threshold for adaptive decoding is set to 0.00050.00050.00050.0005.
Decoding Llama2-7B-chat
turn-1 turn-2 mean
top-p𝑝pitalic_p 6.7688 5.7625 6.2656
adaptive 6.9000 5.9750 6.4375

5.6 Reasoning Tasks

We utilize Llama2-7B-chat and Llama2-13B-chat models to tackle mathematical reasoning tasks in GSM8K (Cobbe et al., 2021) and MultiArith. We randomly selected 300 samples from each dataset and instructed the models with the same instruction shown in Appendix E. The results were checked by humans.

Table 7: Zero-shot evaluation. p𝑝pitalic_p is set to 0.950.950.950.95 while threshold for adaptive decoding is set to 0.00050.00050.00050.0005. All the models use their chat version.
Decoding GSM8K
Llama2-7B Llama2-13B Llama2-70B
greedy 29.33 39.33 56.33
top-p𝑝pitalic_p 29.00 37.00 56.00
adaptive 30.00 40.33 58.67
Decoding MultiArith
Llama2-7B Llama2-13B Llama2-70B
greedy 69.33 66.67 84.67
top-p𝑝pitalic_p 68.33 73.33 85.33
adaptive 69.33 71.33 87.00

Initially, we observed that greedy decoding yielded satisfactory results, as these models were fine-tuned based on human instruction, establishing our baseline. Subsequently, we expanded the candidate size to investigate whether the models exhibited sycophancy or encountered other issues with reasoning, potentially leading to decreased performance. It was observed that increasing the candidate size improved performance for relatively smaller models, and the effect was the opposite for larger models. Nonetheless, our adaptive decoding strategy maintained performance levels.

6 Analysis

6.1 Ablation Study

Effect of threshold

As outlined in Section 3.2, ΔΔ\Deltaroman_ΔConf strongly correlates with the current probability and the size of the allowed set. Consequently, adjusting various thresholds empowers us to manage both of them. The results in Table 8 show that the hyperparameter ϵitalic-ϵ\epsilonitalic_ϵ is readily controllable. As ϵitalic-ϵ\epsilonitalic_ϵ increases, the allowed set size k𝑘kitalic_k diminishes, maintaining a relatively constant cumulative probability within the set while exhibiting an increase in standard deviation. Concurrently, there is an observable rise in model confidence. This trend suggests heightened confidence levels may lead to degradation issues within a constrained allowable set.

Table 8: Effect of threshold and sampling ways. Results are obtained from WritingPrompts.
LM ϵitalic-ϵ\epsilonitalic_ϵ Random
𝐤𝐤\mathbf{k}bold_k cumulative prob. Conf
GPT2-XL 0.001 30.34 ±plus-or-minus\pm± 13.23 77.34 ±plus-or-minus\pm± 14.67 64.89 ±plus-or-minus\pm± 14.13
0.005 8.13 ±plus-or-minus\pm± 4.17 73.20 ±plus-or-minus\pm± 18.40 71.82 ±plus-or-minus\pm± 14.12
0.01 4.31 ±plus-or-minus\pm± 2.58 70.44 ±plus-or-minus\pm± 21.36 75.30 ±plus-or-minus\pm± 14.83
0.02 1.80 ±plus-or-minus\pm± 1.27 75.28 ±plus-or-minus\pm± 26.22 84.06 ±plus-or-minus\pm± 16.30
Llama2 0.001 31.81 ±plus-or-minus\pm± 15.23 79.22 ±plus-or-minus\pm± 13.84 65.08 ±plus-or-minus\pm± 14.91
0.005 8.14 ±plus-or-minus\pm± 4.67 77.65 ±plus-or-minus\pm± 17.07 74.41 ±plus-or-minus\pm± 14.41
0.01 4.51 ±plus-or-minus\pm± 2.70 73.90 ±plus-or-minus\pm± 19.78 77.00 ±plus-or-minus\pm± 14.05
0.02 2.13 ±plus-or-minus\pm± 1.47 75.51 ±plus-or-minus\pm± 23.15 83.89 ±plus-or-minus\pm± 14.21
LM ϵitalic-ϵ\epsilonitalic_ϵ MultiNomial
𝐤𝐤\mathbf{k}bold_k cumulative prob. Conf
GPT2-XL 0.001 30.40 ±plus-or-minus\pm± 12.99 77.50 ±plus-or-minus\pm± 14.59 64.89 ±plus-or-minus\pm± 13.92
0.005 8.21 ±plus-or-minus\pm± 4.17 73.20 ±plus-or-minus\pm± 18.35 71.77 ±plus-or-minus\pm± 14.03
0.01 4.25 ±plus-or-minus\pm± 2.57 70.17 ±plus-or-minus\pm± 21.75 75.20 ±plus-or-minus\pm± 15.12
0.02 1.73 ±plus-or-minus\pm± 1.23 76.93 ±plus-or-minus\pm± 25.70 85.15 ±plus-or-minus\pm± 15.98
Llama2 0.001 21.73 ±plus-or-minus\pm± 15.79 89.09 ±plus-or-minus\pm± 11.00 77.17 ±plus-or-minus\pm± 15.12
0.005 6.47 ±plus-or-minus\pm± 4.56 83.26 ±plus-or-minus\pm± 16.32 80.44 ±plus-or-minus\pm± 14.53
0.01 3.39 ±plus-or-minus\pm± 2.61 81.39 ±plus-or-minus\pm± 19.45 83.76 ±plus-or-minus\pm± 14.61
0.02 1.67 ±plus-or-minus\pm± 1.27 83.00 ±plus-or-minus\pm± 22.63 89.00 ±plus-or-minus\pm± 14.05

Effect of sampling ways

Beyond the significance of truncation space, the method employed for sampling within this space plays a crucial role. In our investigation, we compare random sampling with multinomial sampling. The latter method preserves the probability distribution of the allowed set and samples tokens according to that distribution. As illustrated in Table 8, when considering the GPT2-XL model, different sampling methods exhibit only marginal impacts on the results. This can be attributed to the observation that, for the smaller LM, the distribution within the top-k𝑘kitalic_k space tends to flatten, as depicted in Figure 3.

However, in the case of the Llama2-7B model, an interesting observation emerges: as ϵitalic-ϵ\epsilonitalic_ϵ increases, the multinomial method enhances model confidence and converges on a restricted allowed set. This phenomenon may arise from multinomial sampling being driven by the probability distribution, leading the model to favor tokens with the highest probabilities. Consequently, this contributes to degradation issues for models trained with MLE. Conversely, based on the outcomes of random sampling, including a few low-probability tokens appears to mitigate this degradation phenomenon. This suggests we need to lower ϵitalic-ϵ\epsilonitalic_ϵ when using multinomial sampling.

Refer to caption
Figure 5: Unit tests of the truncation behavior of top-p𝑝pitalic_p, typical, η𝜂\etaitalic_η-sampling and adaptive decoding on CheckList-inspired prefixes. p=0.90𝑝0.90p=0.90italic_p = 0.90, typical=0.90absent0.90=0.90= 0.90, η=0.0009=0.03𝜂0.00090.03\eta=\sqrt{0.0009}=0.03italic_η = square-root start_ARG 0.0009 end_ARG = 0.03, and threshold for adaptive decoding ϵitalic-ϵ\epsilonitalic_ϵ is 0.00050.00050.00050.0005.

6.2 Deocding Latency

We employed these decoding algorithms to generate text based on 100 distinct prefixes. Subsequently, we computed the average token decoding time (after deriving logit from the model). Our implementation relies on the top-p𝑝pitalic_p algorithm. As observed from Table 9, there is a clear trend of decreasing average token decoding time with a reduced vocabulary size.

Table 9: Decoding latency. The computational hardware is NVIDIA RTX 3090, with the model loaded in float16. The unit of measurement is milliseconds (ms).
Decoding GPT2-XL Llama2-7B
greedy 0.0221 0.0265
top-k𝑘kitalic_k 0.3116 0.3466
top-p𝑝pitalic_p 0.4085 0.3706
adaptive 0.5741 0.5265

6.3 Qualitative Analysis

We compare the truncation behaviors of different decoding algorithms through examples extracted from Ribeiro et al. (2020).

In our analysis of the top-p𝑝pitalic_p algorithm, we observed that both top-p𝑝pitalic_p and typical decoding exhibit similar truncation behaviors. They effectively truncate low-entropy distributions but are less suitable for high-entropy distributions. Adaptive decoding shows behavior similar to η𝜂\etaitalic_η-sampling for low-entropy distributions, offering enhanced diversity compared to the previous two methods. However, our method uniquely establishes a bounded candidate set for high-entropy distributions, setting it apart from the other three approaches.

7 Related Work

Deterministic algorithms

Establishing a reasonable size for the next candidate token set seems challenging. Greedy decoding directly uses the highest probability as the next token. Beam search maintains multiple probable tokens as the candidates of the next tokens. However, they often lead to high repetition. Contrastive search (Su et al., 2022) predefines the top-k𝑘kitalic_k space and selects the discriminative token concerning the previous context. However, it is time-consuming and may select the irrational token. Penalty decoding aims to improve greedy decoding by adding penalties to tokens that have appeared before since the model tends to increase the probability of these tokens (Zhu et al., 2023). Look-back decoding (Xu et al., 2023) adds randomness based on greedy decoding, which uses KL divergence to track the distribution distance between the current and historical decoding steps and limit the next token probability distribution to a reasonable distance from the history.

Stochastic algorithms

The most typical random decoding algorithms are top-k𝑘kitalic_k (Fan et al., 2018) and top-p𝑝pitalic_p (Holtzman et al., 2020)algorithms. Some later works are proposed to improve the truncation space. Mirostat (Basu et al., 2021) also proposes an adaptive decoding algorithm. In contrast to our work, they aim to control the perplexity close to a target value. They first calculate one value assuming words follow Zipf’s law and then design a function with this value to calculate the size of the allowed set. η𝜂\etaitalic_η-sampling (Hewitt et al., 2022) proposes to truncate words below an entropy-dependent probability threshold. Typical decoding (Meister et al., 2023b) measures the distance between negative log probabilities and conditional entropy and then takes words from this list to cover p𝑝pitalic_p percent of the distribution. Recently, Finlayson et al. (2024) proposed basis-aware-threshold (BAT) sampling that uses the softmax matrix to find tokens that might have non-zero true probability without relying on a threshold.

8 Conclusion

In conclusion, our research introduces a novel decoding algorithm named adaptive decoding for open-ended text generation, leveraging the thought of entropy reduction to dynamically determine the size of a suitable candidate set. We demonstrated its effectiveness in balancing coherence and diversity through experiments, outperforming existing decoding methods across different language model scales.

Impact Statement

This paper presents work that aims to advance the field of natural language generation. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Acknowledgements

This paper is supported by the National Natural Science Foundation of China (62176153) and the Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102, as the MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University).

References

  • Basu et al. (2021) Basu, S., Ramachandran, G. S., Keskar, N. S., and Varshney, L. R. Mirostat: a neural text decoding algorithm that directly controls perplexity. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  • Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020.
  • Cobbe et al. (2021) Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems, 2021.
  • Fan et al. (2018) Fan, A., Lewis, M., and Dauphin, Y. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1082.
  • Finlayson et al. (2024) Finlayson, M., Hewitt, J., Koller, A., Swayamdipta, S., and Sabharwal, A. Closing the curious case of neural text degeneration. In The Twelfth International Conference on Learning Representations, 2024.
  • Gao et al. (2021) Gao, T., Yao, X., and Chen, D. SimCSE: Simple contrastive learning of sentence embeddings. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  6894–6910, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.552.
  • Hewitt et al. (2022) Hewitt, J., Manning, C., and Liang, P. Truncation sampling as language model desmoothing. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  3414–3427, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
  • Holtzman et al. (2020) Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
  • Li et al. (2020) Li, M., Roller, S., Kulikov, I., Welleck, S., Boureau, Y.-L., Cho, K., and Weston, J. Don’t say that! making inconsistent dialogue unlikely with unlikelihood training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4715–4728, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.428.
  • Meister et al. (2023a) Meister, C., Pimentel, T., Malagutti, L., Wilcox, E., and Cotterell, R. On the efficacy of sampling adapters. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1437–1455, Toronto, Canada, July 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.80.
  • Meister et al. (2023b) Meister, C., Pimentel, T., Wiher, G., and Cotterell, R. Locally typical sampling. Transactions of the Association for Computational Linguistics, 11:102–121, 2023b.
  • Merity et al. (2017) Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017.
  • Mostafazadeh et al. (2016) Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P., and Allen, J. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  839–849, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1098.
  • OpenAI et al. (2023) OpenAI, :, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonoff, L., Boiko, O., Boyd, M., Brakman, A.-L., Brockman, G., Brooks, T., Brundage, M., Button, K., Cai, T., Campbell, R., Cann, A., Carey, B., Carlson, C., Carmichael, R., Chan, B., Chang, C., Chantzis, F., Chen, D., Chen, S., Chen, R., Chen, J., Chen, M., Chess, B., Cho, C., Chu, C., Chung, H. W., Cummings, D., Currier, J., Dai, Y., Decareaux, C., Degry, T., Deutsch, N., Deville, D., Dhar, A., Dohan, D., Dowling, S., Dunning, S., Ecoffet, A., Eleti, A., Eloundou, T., Farhi, D., Fedus, L., Felix, N., Fishman, S. P., Forte, J., Fulford, I., Gao, L., Georges, E., Gibson, C., Goel, V., Gogineni, T., Goh, G., Gontijo-Lopes, R., Gordon, J., Grafstein, M., Gray, S., Greene, R., Gross, J., Gu, S. S., Guo, Y., Hallacy, C., Han, J., Harris, J., He, Y., Heaton, M., Heidecke, J., Hesse, C., Hickey, A., Hickey, W., Hoeschele, P., Houghton, B., Hsu, K., Hu, S., Hu, X., Huizinga, J., Jain, S., Jain, S., Jang, J., Jiang, A., Jiang, R., **, H., **, D., Jomoto, S., Jonn, B., Jun, H., Kaftan, T., Łukasz Kaiser, Kamali, A., Kanitscheider, I., Keskar, N. S., Khan, T., Kilpatrick, L., Kim, J. W., Kim, C., Kim, Y., Kirchner, H., Kiros, J., Knight, M., Kokotajlo, D., Łukasz Kondraciuk, Kondrich, A., Konstantinidis, A., Kosic, K., Krueger, G., Kuo, V., Lampe, M., Lan, I., Lee, T., Leike, J., Leung, J., Levy, D., Li, C. M., Lim, R., Lin, M., Lin, S., Litwin, M., Lopez, T., Lowe, R., Lue, P., Makanju, A., Malfacini, K., Manning, S., Markov, T., Markovski, Y., Martin, B., Mayer, K., Mayne, A., McGrew, B., McKinney, S. M., McLeavey, C., McMillan, P., McNeil, J., Medina, D., Mehta, A., Menick, J., Metz, L., Mishchenko, A., Mishkin, P., Monaco, V., Morikawa, E., Mossing, D., Mu, T., Murati, M., Murk, O., Mély, D., Nair, A., Nakano, R., Nayak, R., Neelakantan, A., Ngo, R., Noh, H., Ouyang, L., O’Keefe, C., Pachocki, J., Paino, A., Palermo, J., Pantuliano, A., Parascandolo, G., Parish, J., Parparita, E., Passos, A., Pavlov, M., Peng, A., Perelman, A., de Avila Belbute Peres, F., Petrov, M., de Oliveira Pinto, H. P., Michael, Pokorny, Pokrass, M., Pong, V., Powell, T., Power, A., Power, B., Proehl, E., Puri, R., Radford, A., Rae, J., Ramesh, A., Raymond, C., Real, F., Rimbach, K., Ross, C., Rotsted, B., Roussez, H., Ryder, N., Saltarelli, M., Sanders, T., Santurkar, S., Sastry, G., Schmidt, H., Schnurr, D., Schulman, J., Selsam, D., Sheppard, K., Sherbakov, T., Shieh, J., Shoker, S., Shyam, P., Sidor, S., Sigler, E., Simens, M., Sitkin, J., Slama, K., Sohl, I., Sokolowsky, B., Song, Y., Staudacher, N., Such, F. P., Summers, N., Sutskever, I., Tang, J., Tezak, N., Thompson, M., Tillet, P., Tootoonchian, A., Tseng, E., Tuggle, P., Turley, N., Tworek, J., Uribe, J. F. C., Vallone, A., Vijayvergiya, A., Voss, C., Wainwright, C., Wang, J. J., Wang, A., Wang, B., Ward, J., Wei, J., Weinmann, C., Welihinda, A., Welinder, P., Weng, J., Weng, L., Wiethoff, M., Willner, D., Winter, C., Wolrich, S., Wong, H., Workman, L., Wu, S., Wu, J., Wu, M., Xiao, K., Xu, T., Yoo, S., Yu, K., Yuan, Q., Zaremba, W., Zellers, R., Zhang, C., Zhang, M., Zhao, S., Zheng, T., Zhuang, J., Zhuk, W., and Zoph, B. Gpt-4 technical report, 2023.
  • Pillutla et al. (2021) Pillutla, K., Swayamdipta, S., Zellers, R., Thickstun, J., Welleck, S., Choi, Y., and Harchaoui, Z. Mauve: Measuring the gap between neural text and human text using divergence frontiers. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  4816–4828. Curran Associates, Inc., 2021.
  • Radford et al. (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • Ren et al. (2024) Ren, S., Wu, Z., and Zhu, K. Q. EMO: EARTH MOVER DISTANCE OPTIMIZATION FOR AUTO-REGRESSIVE LANGUAGE MODELING. In The Twelfth International Conference on Learning Representations, 2024.
  • Ribeiro et al. (2020) Ribeiro, M. T., Wu, T., Guestrin, C., and Singh, S. Beyond accuracy: Behavioral testing of NLP models with CheckList. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4902–4912, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.442.
  • Su et al. (2022) Su, Y., Lan, T., Wang, Y., Yogatama, D., Kong, L., and Collier, N. A contrastive framework for neural text generation. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  21548–21561. Curran Associates, Inc., 2022.
  • Thoppilan et al. (2022) Thoppilan, R., Freitas, D. D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., **, A., Bos, T., Baker, L., Du, Y., Li, Y., Lee, H., Zheng, H. S., Ghafouri, A., Menegali, M., Huang, Y., Krikun, M., Lepikhin, D., Qin, J., Chen, D., Xu, Y., Chen, Z., Roberts, A., Bosma, M., Zhao, V., Zhou, Y., Chang, C.-C., Krivokon, I., Rusch, W., Pickett, M., Srinivasan, P., Man, L., Meier-Hellstern, K., Morris, M. R., Doshi, T., Santos, R. D., Duke, T., Soraker, J., Zevenbergen, B., Prabhakaran, V., Diaz, M., Hutchinson, B., Olson, K., Molina, A., Hoffman-John, E., Lee, J., Aroyo, L., Rajakumar, R., Butryna, A., Lamm, M., Kuzmina, V., Fenton, J., Cohen, A., Bernstein, R., Kurzweil, R., Aguera-Arcas, B., Cui, C., Croak, M., Chi, E., and Le, Q. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  • Touvron et al. (2023) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. Llama 2: Open foundation and fine-tuned chat models, 2023.
  • Welleck et al. (2020) Welleck, S., Kulikov, I., Roller, S., Dinan, E., Cho, K., and Weston, J. Neural text generation with unlikelihood training. In International Conference on Learning Representations, 2020.
  • Xu et al. (2022) Xu, J., Liu, X., Yan, J., Cai, D., Li, H., and Li, J. Learning to break the loop: Analyzing and mitigating repetitions for neural text generation. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  3082–3095. Curran Associates, Inc., 2022.
  • Xu et al. (2023) Xu, N., Zhou, C., Celikyilmaz, A., and Ma, X. Look-back decoding for open-ended text generation. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  1039–1050, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.66.
  • Zheng et al. (2024) Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
  • Zhu et al. (2023) Zhu, W., Hao, H., and Wang, R. Penalty decoding: Well suppress the self-reinforcement effect in open-ended text generation. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  1218–1228, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.78.

Appendix A Limitations

We acknowledge that implementing the adaptive decoding algorithm introduces additional overhead. We hope that future work can address and eliminate this overhead. While we have theoretically demonstrated the rationale behind our method, further experimental validation through more extensive reasoning tasks is needed to confirm its effectiveness. Additionally, we anticipate the development of improved metrics for evaluating the results of model generation. Moreover, our verification of the decoding algorithm has been limited to the language models only and has not yet been extended to other areas like the multimodal domain.

Appendix B Entropy Change

It is evident from LABEL:fig:_entropy that the entropy of the model experiences notable fluctuations during text generation. Furthermore, distinct word segmentation among various models yields varying upper bounds for text generation entropy in each model. The amplitude of entropy changes between models is also influenced by the quality of model pre-training. Therefore, normalizing entropy becomes important to mitigate the impact of dimensional effects.

Appendix C Proof

C.1 Proof of Section 3.1

Examine Equation 5 closely, and it becomes evident that pklogpksubscript𝑝𝑘subscript𝑝𝑘p_{k}\log p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT represents the average uncertainty carried into the known segment. On the other hand, the latter term (1ikpi)log(1ikpi)|𝒱|k(1ik1pi)log(1ik1pi)|𝒱|k+11superscriptsubscript𝑖𝑘subscript𝑝𝑖1superscriptsubscript𝑖𝑘subscript𝑝𝑖𝒱𝑘1superscriptsubscript𝑖𝑘1subscript𝑝𝑖1superscriptsubscript𝑖𝑘1subscript𝑝𝑖𝒱𝑘1(1-\sum_{i}^{k}p_{i})\log\frac{(1-\sum_{i}^{k}p_{i})}{|\mathcal{V}|-k}-(1-\sum% _{i}^{k-1}p_{i})\log\frac{(1-\sum_{i}^{k-1}p_{i})}{|\mathcal{V}|-k+1}( 1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log divide start_ARG ( 1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG | caligraphic_V | - italic_k end_ARG - ( 1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log divide start_ARG ( 1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG | caligraphic_V | - italic_k + 1 end_ARG reflects the certainty introduced into the unknown portion. It can be shown that the increase in certainty surpasses the introduction of uncertainty as the known part expands.

Corollary C.1.

When sorting the state distribution X𝑋Xitalic_X predicted by the language model in descending order after applying the softmax function, it follows that pk1ik1pi|𝒱|k+1subscript𝑝𝑘1superscriptsubscript𝑖𝑘1subscript𝑝𝑖𝒱𝑘1p_{k}\geq\frac{1-\sum_{i}^{k-1}p_{i}}{|\mathcal{V}|-k+1}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ divide start_ARG 1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG | caligraphic_V | - italic_k + 1 end_ARG.

Proof.

We acknowledge that p1>p2>>p|𝒱|subscript𝑝1subscript𝑝2subscript𝑝𝒱p_{1}>p_{2}>...>p_{|\mathcal{V}|}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > … > italic_p start_POSTSUBSCRIPT | caligraphic_V | end_POSTSUBSCRIPT with i|𝒱|pi=1superscriptsubscript𝑖𝒱subscript𝑝𝑖1\sum_{i}^{|\mathcal{V}|}p_{i}=1∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_V | end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1. If, however, pk<1ik1pi|𝒱|k+1subscript𝑝𝑘1superscriptsubscript𝑖𝑘1subscript𝑝𝑖𝒱𝑘1p_{k}<\frac{1-\sum_{i}^{k-1}p_{i}}{|\mathcal{V}|-k+1}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT < divide start_ARG 1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG | caligraphic_V | - italic_k + 1 end_ARG, it implies the existence of j𝑗jitalic_j within the range [k+1,|𝒱|]𝑘1𝒱[k+1,|\mathcal{V}|][ italic_k + 1 , | caligraphic_V | ] such that pj>pksubscript𝑝𝑗subscript𝑝𝑘p_{j}>p_{k}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. This assertion results in a contradiction. ∎

Corollary C.2.

Given that the sum of probabilities for all tokens in the vocabulary is expressed as i|𝒱|pi=1superscriptsubscript𝑖𝒱subscript𝑝𝑖1\sum_{i}^{|\mathcal{V}|}p_{i}=1∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_V | end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1, we can deduce that 1ik1pipk=k+1|𝒱|pi01superscriptsubscript𝑖𝑘1subscript𝑝𝑖subscript𝑝𝑘superscriptsubscript𝑘1𝒱subscript𝑝𝑖01-\sum_{i}^{k-1}p_{i}-p_{k}=\sum_{k+1}^{|\mathcal{V}|}p_{i}\geq 01 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_V | end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0. Observing this inequality, it is evident that 1ik1pipk1superscriptsubscript𝑖𝑘1subscript𝑝𝑖subscript𝑝𝑘1-\sum_{i}^{k-1}p_{i}\geq p_{k}1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

Corollary C.3.

The function Conf(X)k{}_{k}(X)start_FLOATSUBSCRIPT italic_k end_FLOATSUBSCRIPT ( italic_X ) is monotonically increasing, as it is always guaranteed that ΔΔ\Deltaroman_ΔConf 0absent0\geq 0≥ 0.

C.2 Proof of Section 3.3

Proof.

It is intuitively correct that H(X)=i𝒱pilogpiH𝑋superscriptsubscript𝑖𝒱subscript𝑝𝑖subscript𝑝𝑖\mathrm{H}(X)=-\sum_{i}^{\mathcal{V}}p_{i}\log p_{i}roman_H ( italic_X ) = - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_V end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT converges to ikpilogpisuperscriptsubscript𝑖𝑘subscript𝑝𝑖subscript𝑝𝑖-\sum_{i}^{k}p_{i}\log p_{i}- ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT under the condition Pθ(X>k+m|X>k)1subscript𝑃𝜃𝑋𝑘𝑚ket𝑋𝑘1P_{\theta}(X>k+m|X>k)\rightarrow 1italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X > italic_k + italic_m | italic_X > italic_k ) → 1. In accordance with Corollary C.1, the following inequality holds:

log|𝒱|ΔConfpklogpk(|𝒱|k)1ikpi𝒱ΔConfsubscript𝑝𝑘subscript𝑝𝑘𝒱𝑘1superscriptsubscript𝑖𝑘subscript𝑝𝑖\displaystyle\log|\mathcal{V}|\Delta\text{Conf}\geq p_{k}\log\frac{p_{k}(|% \mathcal{V}|-k)}{1-\sum_{i}^{k}p_{i}}roman_log | caligraphic_V | roman_Δ Conf ≥ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( | caligraphic_V | - italic_k ) end_ARG start_ARG 1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG
+(1ik1pi)log1pk+(1ik1pi)log(1ikpi)|𝒱|k.1superscriptsubscript𝑖𝑘1subscript𝑝𝑖1subscript𝑝𝑘1superscriptsubscript𝑖𝑘1subscript𝑝𝑖1superscriptsubscript𝑖𝑘subscript𝑝𝑖𝒱𝑘\displaystyle+(1-\sum_{i}^{k-1}p_{i})\log\frac{1}{p_{k}}+(1-\sum_{i}^{k-1}p_{i% })\log\frac{(1-\sum_{i}^{k}p_{i})}{|\mathcal{V}|-k}.+ ( 1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + ( 1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log divide start_ARG ( 1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG | caligraphic_V | - italic_k end_ARG .

Moreover, based on Corollary C.2, we can further derive:

log|𝒱|ΔConfpklogpk(|𝒱|k)1ikpi𝒱ΔConfsubscript𝑝𝑘subscript𝑝𝑘𝒱𝑘1superscriptsubscript𝑖𝑘subscript𝑝𝑖\displaystyle\log|\mathcal{V}|\Delta\text{Conf}\geq p_{k}\log\frac{p_{k}(|% \mathcal{V}|-k)}{1-\sum_{i}^{k}p_{i}}roman_log | caligraphic_V | roman_Δ Conf ≥ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( | caligraphic_V | - italic_k ) end_ARG start_ARG 1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG
+pklog(1ikpi)|𝒱|k+pklog1pkpklog1=0.subscript𝑝𝑘1superscriptsubscript𝑖𝑘subscript𝑝𝑖𝒱𝑘subscript𝑝𝑘1subscript𝑝𝑘subscript𝑝𝑘10\displaystyle+p_{k}\log\frac{(1-\sum_{i}^{k}p_{i})}{|\mathcal{V}|-k}+p_{k}\log% \frac{1}{p_{k}}\geq p_{k}\log 1=0.+ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_log divide start_ARG ( 1 - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG | caligraphic_V | - italic_k end_ARG + italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_log divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ≥ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_log 1 = 0 .

Detailed mathematical proof in Corollary 3.4 is as follows.

Consider

H(X)=i=1kPθ(xi)logPθ(xi)k+1|𝒱|Pθ(xi)logPθ(xi).𝐻𝑋superscriptsubscript𝑖1𝑘subscript𝑃𝜃subscript𝑥𝑖subscript𝑃𝜃subscript𝑥𝑖superscriptsubscript𝑘1𝒱subscript𝑃𝜃subscript𝑥𝑖subscript𝑃𝜃subscript𝑥𝑖H(X)=-\sum_{i=1}^{k}P_{\theta}(x_{i})\log P_{\theta}(x_{i})-\sum_{k+1}^{|% \mathcal{V}|}P_{\theta}(x_{i})\log P_{\theta}(x_{i}).italic_H ( italic_X ) = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_V | end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

As Pθ(X>k+m|X>k)1subscript𝑃𝜃𝑋𝑘𝑚ket𝑋𝑘1P_{\theta}(X>k+m|X>k)\rightarrow 1italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X > italic_k + italic_m | italic_X > italic_k ) → 1,

k+1|𝒱|Pθ(xi)logPθ(xi)0.superscriptsubscript𝑘1𝒱subscript𝑃𝜃subscript𝑥𝑖subscript𝑃𝜃subscript𝑥𝑖0-\sum_{k+1}^{|\mathcal{V}|}P_{\theta}(x_{i})\log P_{\theta}(x_{i})\rightarrow 0.- ∑ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_V | end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) → 0 . (7)

We define the binary random variable

B(u)={1 if Xu0 if X>u.𝐵𝑢cases1 if 𝑋𝑢0 if 𝑋𝑢B(u)=\left\{\begin{array}[]{l}1\text{ if }X\leq u\\ 0\text{ if }X>u.\end{array}\right.italic_B ( italic_u ) = { start_ARRAY start_ROW start_CELL 1 if italic_X ≤ italic_u end_CELL end_ROW start_ROW start_CELL 0 if italic_X > italic_u . end_CELL end_ROW end_ARRAY

.

Then, we can express H(X)𝐻𝑋H(X)italic_H ( italic_X ) in terms of B(u)𝐵𝑢B(u)italic_B ( italic_u ) as:

H(X)𝐻𝑋\displaystyle H(X)italic_H ( italic_X ) =H(XB(u))+I(X;B(u))absent𝐻conditional𝑋𝐵𝑢𝐼𝑋𝐵𝑢\displaystyle=H(X\mid B(u))+I(X;B(u))= italic_H ( italic_X ∣ italic_B ( italic_u ) ) + italic_I ( italic_X ; italic_B ( italic_u ) )
=H(XB(u)=1)Pr{B(u)=1}+H(XB(u)=0)×Pr{B(u)=0}+I(X;B(u))absent𝐻conditional𝑋𝐵𝑢1Pr𝐵𝑢1𝐻conditional𝑋𝐵𝑢0Pr𝐵𝑢0𝐼𝑋𝐵𝑢\displaystyle=H(X\mid B(u)=1)\operatorname{Pr}\{B(u)=1\}+H(X\mid B(u)=0)\times% \operatorname{Pr}\{B(u)=0\}+I(X;B(u))= italic_H ( italic_X ∣ italic_B ( italic_u ) = 1 ) roman_Pr { italic_B ( italic_u ) = 1 } + italic_H ( italic_X ∣ italic_B ( italic_u ) = 0 ) × roman_Pr { italic_B ( italic_u ) = 0 } + italic_I ( italic_X ; italic_B ( italic_u ) )
=H(X(u))Pr{B(u)=1}+H(XB(u)=0)×Pr{B(u)=0}+I(X;B(u)).absent𝐻𝑋𝑢Pr𝐵𝑢1𝐻conditional𝑋𝐵𝑢0Pr𝐵𝑢0𝐼𝑋𝐵𝑢\displaystyle=H(X(u))\operatorname{Pr}\{B(u)=1\}+H(X\mid B(u)=0)\times% \operatorname{Pr}\{B(u)=0\}+I(X;B(u)).= italic_H ( italic_X ( italic_u ) ) roman_Pr { italic_B ( italic_u ) = 1 } + italic_H ( italic_X ∣ italic_B ( italic_u ) = 0 ) × roman_Pr { italic_B ( italic_u ) = 0 } + italic_I ( italic_X ; italic_B ( italic_u ) ) .

As uk𝑢𝑘u\rightarrow kitalic_u → italic_k, we have Pr(B(u)=1)1Pr𝐵𝑢11\operatorname{Pr}(B(u)=1)\rightarrow 1roman_Pr ( italic_B ( italic_u ) = 1 ) → 1, leading to H(B(u))0𝐻𝐵𝑢0H(B(u))\rightarrow 0italic_H ( italic_B ( italic_u ) ) → 0. Consequently, I(X;B(u))0𝐼𝑋𝐵𝑢0I(X;B(u))\rightarrow 0italic_I ( italic_X ; italic_B ( italic_u ) ) → 0 due to I(X;B(u))H(B(u))𝐼𝑋𝐵𝑢𝐻𝐵𝑢I(X;B(u))\leq H(B(u))italic_I ( italic_X ; italic_B ( italic_u ) ) ≤ italic_H ( italic_B ( italic_u ) ).

Additionally, H(X|B(u)=0)Pr(B(u)=0)𝐻conditional𝑋𝐵𝑢0Pr𝐵𝑢0H(X|B(u)=0)\operatorname{Pr}(B(u)=0)italic_H ( italic_X | italic_B ( italic_u ) = 0 ) roman_Pr ( italic_B ( italic_u ) = 0 ) can be simplified as:

H(\displaystyle H(italic_H ( XB(u)=0)Pr{B(u)=0}\displaystyle X\mid B(u)=0)\operatorname{Pr}\{B(u)=0\}italic_X ∣ italic_B ( italic_u ) = 0 ) roman_Pr { italic_B ( italic_u ) = 0 }
=i=k+1Pθ(xi)logPθ(xi)Pr{B(u)=0}absentsuperscriptsubscript𝑖𝑘1subscript𝑃𝜃subscript𝑥𝑖subscript𝑃𝜃subscript𝑥𝑖Pr𝐵𝑢0\displaystyle=-\sum_{i=k+1}^{\infty}P_{\theta}(x_{i})\log\frac{P_{\theta}(x_{i% })}{\operatorname{Pr}\{B(u)=0\}}= - ∑ start_POSTSUBSCRIPT italic_i = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log divide start_ARG italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Pr { italic_B ( italic_u ) = 0 } end_ARG
=\displaystyle== i=k+1Pθ(xi)logPθ(xi)logPr{B(u)=0})\displaystyle-\sum_{i=k+1}^{\infty}P_{\theta}(x_{i})\log P_{\theta}(x_{i})-% \log\operatorname{Pr}\{B(u)=0\})- ∑ start_POSTSUBSCRIPT italic_i = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - roman_log roman_Pr { italic_B ( italic_u ) = 0 } )
=\displaystyle== i=k+1Pθ(xi)logPθ(xi)+(i=k+1Pθ(xi))logPr{B(u)=0}superscriptsubscript𝑖𝑘1subscript𝑃𝜃subscript𝑥𝑖subscript𝑃𝜃subscript𝑥𝑖superscriptsubscript𝑖𝑘1subscript𝑃𝜃subscript𝑥𝑖Pr𝐵𝑢0\displaystyle-\sum_{i=k+1}^{\infty}P_{\theta}(x_{i})\log P_{\theta}(x_{i})+% \left(\sum_{i=k+1}^{\infty}P_{\theta}(x_{i})\right)\log\operatorname{Pr}\{B(u)% =0\}- ∑ start_POSTSUBSCRIPT italic_i = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ( ∑ start_POSTSUBSCRIPT italic_i = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) roman_log roman_Pr { italic_B ( italic_u ) = 0 }
=\displaystyle== i=k+1Pθ(xi)logPθ(xi)+Pr{B(u)=0}logPr{B(u)=0}.superscriptsubscript𝑖𝑘1subscript𝑃𝜃subscript𝑥𝑖subscript𝑃𝜃subscript𝑥𝑖Pr𝐵𝑢0Pr𝐵𝑢0\displaystyle-\sum_{i=k+1}^{\infty}P_{\theta}(x_{i})\log P_{\theta}(x_{i})+% \operatorname{Pr}\{B(u)=0\}\log\operatorname{Pr}\{B(u)=0\}.- ∑ start_POSTSUBSCRIPT italic_i = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + roman_Pr { italic_B ( italic_u ) = 0 } roman_log roman_Pr { italic_B ( italic_u ) = 0 } .

As uk𝑢𝑘u\rightarrow kitalic_u → italic_k, we have Pr(B(u)=0)0Pr𝐵𝑢00\operatorname{Pr}(B(u)=0)\rightarrow 0roman_Pr ( italic_B ( italic_u ) = 0 ) → 0 and Equation 7, then both terms lead to zero. In conclusion, we obtain H(X)=H(X(k))+ϵ𝐻𝑋𝐻𝑋𝑘italic-ϵH(X)=H(X(k))+\epsilonitalic_H ( italic_X ) = italic_H ( italic_X ( italic_k ) ) + italic_ϵ, where ϵitalic-ϵ\epsilonitalic_ϵ is a smaller number.

Appendix D Hyperparameter

Choosing hyperparameters seems a challenging task, and the process of our parameter selection is illustrated in Table 10 and Table 11. Here are some interesting things we found:

  • The top-k𝑘kitalic_k and top-p𝑝pitalic_p algorithms enhance diversity by expanding the truncation space, leading to higher MAUVE values. This occurrence can be attributed to the ability of LMs to sample certain low-probability tokens, which are proximate to the probabilities preferred by humans. This proximity results in a closer match between the two curves illustrated in LABEL:fig:generation_1. However, it is crucial to recognize that the concurrent increase in diversity corresponds to a decrease in coherence.

  • Mirostat excels in effectively managing text coherence by controlling model perplexity. The ease of controlling parameters further enhances its utility, even though some of the observed increase in coherence may be attributed to the rise in indices resulting from repetition. Additionally, our investigation reveals that Mirostat, when applied to the Llama2-7B model, maintains high coherence and ensures a certain level of diversity.

Table 10: Optimal hyperparameter selection for various decoding algorithms in the GPT2-XL model. The line where the bold is located designates the optimal parameter value. We repeat the sampling process five times for each prefix and report the average score.
Decoding WritingPrompts WikiText-103
rep-2 rep-3 rep-4 Diversity MAUVE Coherence rep-2 rep-3 rep-4 Diversity MAUVE Coherence
Human 7.70 1.82 0.79 90.06 29.57 5.14 1.02 0.33 93.66 63.49
k=5𝑘5k=5italic_k = 5 32.86 24.03 19.55 49.37 23.94 53.16 19.18 10.37 6.90 70.40 82.29 62.57
k=10𝑘10k=10italic_k = 10 19.60 11.83 8.71 69.82 33.65 50.04 11.50 4.44 2.44 83.41 82.49 62.25
k=15𝑘15k=15italic_k = 15 14.91 8.35 6.16 76.86 44.92 49.51 9.90 3.64 2.01 86.24 88.75 61.41
k=20𝑘20k=20italic_k = 20 12.23 6.26 4.35 81.22 48.25 47.87 8.52 2.78 1.48 88.51 88.67 61.61
p=0.80𝑝0.80p=0.80italic_p = 0.80 25.39 18.79 15.90 61.61 30.29 49.13 12.98 5.94 3.84 80.89 77.74 61.11
p=0.85𝑝0.85p=0.85italic_p = 0.85 19.78 13.14 10.80 70.02 35.10 48.20 11.23 4.87 3.02 83.73 79.61 61.46
p=0.90𝑝0.90p=0.90italic_p = 0.90 15.16 9.14 7.07 76.96 38.52 47.15 9.57 3.67 2.17 86.62 84.46 61.13
p=0.95𝑝0.95p=0.95italic_p = 0.95 12.21 7.04 5.53 81.43 51.69 47.35 7.70 2.70 1.49 89.24 88.50 60.87
mirostat2.02.0-2.0- 2.0 59.05 52.33 47.75 17.26 13.75 50.21 45.09 35.99 30.57 31.43 56.59 60.34
mirostat3.03.0-3.0- 3.0 58.73 51.91 47.27 17.82 16.52 49.63 45.13 36.09 30.61 31.27 55.29 60.73
mirostat4.04.0-4.0- 4.0 58.62 51.84 47.17 17.69 16.30 49.28 45.14 36.07 30.60 31.11 56.56 61.01
mirostat5.05.0-5.0- 5.0 58.55 51.70 47.04 18.17 16.91 49.71 45.46 36.46 31.03 31.11 53.55 61.09
τ=0.90𝜏0.90\tau=0.90italic_τ = 0.90 15.24 9.24 7.25 76.74 38.10 46.64 9.38 3.60 2.14 86.94 85.35 61.27
τ=0.92𝜏0.92\tau=0.92italic_τ = 0.92 14.02 8.09 6.27 78.65 43.92 46.63 9.08 3.35 2.06 87.47 84.99 61.00
τ=0.95𝜏0.95\tau=0.95italic_τ = 0.95 12.05 6.96 5.32 81.47 38.01 46.99 8.02 2.73 1.52 89.10 85.39 60.89
τ=0.99𝜏0.99\tau=0.99italic_τ = 0.99 9.22 4.63 3.64 85.42 40.57 46.59 6.74 1.93 0.95 91.13 90.07 59.93
η=0.004𝜂0.004\eta=0.004italic_η = 0.004 12.33 7.08 5.66 80.95 41.86 46.69 7.83 2.63 1.48 89.22 90.40 61.10
η=0.002𝜂0.002\eta=0.002italic_η = 0.002 9.61 4.98 3.63 85.42 47.24 45.60 7.58 2.75 1.80 89.81 82.18 59.55
η=0.0009𝜂0.0009\eta=0.0009italic_η = 0.0009 9.10 4.58 3.41 86.58 59.34 45.67 6.45 1.75 0.98 91.38 87.17 60.33
η=0.0006𝜂0.0006\eta=0.0006italic_η = 0.0006 8.99 4.38 3.67 85.98 46.23 44.38 6.44 1.72 0.82 91.64 86.45 60.37
ϵ=0.0005italic-ϵ0.0005\epsilon=0.0005italic_ϵ = 0.0005 5.46 3.16 2.75 91.38 47.19 36.38 4.82 1.55 1.00 93.81 83.72 56.58
ϵ=0.001italic-ϵ0.001\epsilon=0.001italic_ϵ = 0.001 7.07 3.85 3.11 89.32 55.19 37.72 5.22 1.51 0.93 93.00 84.54 56.39
ϵ=0.005italic-ϵ0.005\epsilon=0.005italic_ϵ = 0.005 40.69 33.57 29.95 39.80 11.02 47.51 21.10 12.50 9.10 67.63 69.27 61.30
ϵ=0.01italic-ϵ0.01\epsilon=0.01italic_ϵ = 0.01 71.14 67.75 65.28 9.81 2.89 50.71 45.17 37.82 33.96 34.02 23.08 61.04
Table 11: Optimal hyperparameter selection for various decoding algorithms in the Llama2-7B model. The line where the bold is located designates the optimal parameter value. We repeat the sampling process five times for each prefix and report the average score.
Decoding WritingPrompts WikiText-103
rep-2 rep-3 rep-4 Diversity MAUVE Coherence rep-2 rep-3 rep-4 Diversity MAUVE Coherence
Human 7.70 1.82 0.79 90.06 29.57 5.14 1.02 0.33 93.66 63.49
k=5𝑘5k=5italic_k = 5 22.69 14.76 11.10 66.96 39.66 27.15 12.14 5.56 3.20 81.69 63.82 59.43
k=10𝑘10k=10italic_k = 10 14.21 7.26 5.05 79.94 45.34 27.44 9.35 3.70 2.07 86.51 74.15 59.51
k=15𝑘15k=15italic_k = 15 11.00 5.42 3.65 84.46 40.40 25.98 7.88 2.76 1.40 89.11 82.88 59.38
k=20𝑘20k=20italic_k = 20 9.62 4.19 2.94 86.29 47.36 26.25 7.44 2.55 1.40 89.54 87.07 59.27
p=0.80𝑝0.80p=0.80italic_p = 0.80 17.85 10.92 8.23 74.17 46.40 27.23 11.22 5.05 2.94 83.40 75.67 59.33
p=0.85𝑝0.85p=0.85italic_p = 0.85 13.51 7.66 5.58 80.26 44.91 27.55 9.64 4.01 2.29 85.84 79.15 58.94
p=0.90𝑝0.90p=0.90italic_p = 0.90 11.02 5.41 3.91 83.58 55.53 31.11 8.20 3.02 1.58 88.37 80.49 59.54
p=0.95𝑝0.95p=0.95italic_p = 0.95 8.39 4.32 3.26 87.96 58.75 26.99 7.13 2.52 1.32 90.12 82.22 58.89
mirostat2.02.0-2.0- 2.0 13.58 7.99 6.04 79.94 44.55 55.91 12.25 6.10 3.94 80.99 81.55 66.51
mirostat3.03.0-3.0- 3.0 13.46 7.85 5.88 80.04 50.51 54.83 12.36 6.14 3.99 80.72 82.06 65.87
mirostat4.04.0-4.0- 4.0 14.95 8.59 6.85 79.11 42.97 55.10 12.70 6.50 4.32 80.06 81.09 66.31
mirostat5.05.0-5.0- 5.0 15.08 9.82 7.88 77.82 45.22 54.84 12.49 6.23 4.06 80.53 78.36 66.46
τ=0.90𝜏0.90\tau=0.90italic_τ = 0.90 10.83 5.36 3.84 84.65 38.07 25.23 8.24 3.24 1.77 88.10 77.49 59.23
τ=0.92𝜏0.92\tau=0.92italic_τ = 0.92 9.37 4.37 3.38 86.86 33.88 27.80 7.68 2.80 1.46 89.17 82.78 58.79
τ=0.95𝜏0.95\tau=0.95italic_τ = 0.95 9.54 5.03 3.78 87.09 40.18 26.80 6.81 2.20 1.06 90.61 87.75 58.83
τ=0.99𝜏0.99\tau=0.99italic_τ = 0.99 8.67 4.53 4.13 88.93 53.26 25.76 6.00 1.98 1.00 91.75 80.68 58.50
η=0.004𝜂0.004\eta=0.004italic_η = 0.004 7.86 3.42 2.50 88.69 48.93 26.30 7.15 2.47 1.30 90.04 77.47 58.76
η=0.002𝜂0.002\eta=0.002italic_η = 0.002 7.17 3.58 2.62 89.76 48.34 26.88 6.41 2.21 1.17 91.15 81.49 58.76
η=0.0009𝜂0.0009\eta=0.0009italic_η = 0.0009 7.10 3.04 2.14 90.42 61.30 26.78 5.68 1.72 0.85 92.41 86.14 58.76
η=0.0006𝜂0.0006\eta=0.0006italic_η = 0.0006 6.69 2.93 2.42 90.43 63.80 26.35 5.87 1.73 0.76 92.04 88.28 58.75
ϵ=0.0005italic-ϵ0.0005\epsilon=0.0005italic_ϵ = 0.0005 7.21 2.84 1.74 89.54 69.00 35.40 5.90 1.78 0.79 92.00 87.18 58.78
ϵ=0.001italic-ϵ0.001\epsilon=0.001italic_ϵ = 0.001 8.38 3.54 2.14 87.76 59.34 35.92 6.74 2.16 1.03 90.65 87.81 59.26
ϵ=0.005italic-ϵ0.005\epsilon=0.005italic_ϵ = 0.005 19.22 11.44 8.43 70.75 58.23 42.35 10.42 4.61 2.69 84.76 76.91 60.03
ϵ=0.01italic-ϵ0.01\epsilon=0.01italic_ϵ = 0.01 35.60 27.20 22.96 47.12 43.69 44.02 14.11 7.34 4.65 78.55 68.76 58.69

Appendix E Instrcution

In Section 5.5, we use the following instruction, where the sentence is derived from the prompt in WritingPompts.

<s>[INST] <<SYS>>
You’re a writer. You will write a story given my prefix.
<</SYS>>

Prefix: {sentence} [/INST]

In Section 5.6, we use the following instruction, where the problem is derived from the dataset GSM8K and MultiArith.

<s>[INST] <<SYS>>You are a help assistant and a math expert.
Please solve the following question and directly return me the answer.<</SYS>>
    Problem: {sentence}
    Let’s think step by step\n[/INST]

Appendix F Human Evaluation

Refer to caption
Figure 7: Our application for human evaluation.

The human evaluation adheres to the principles of blind review, ensuring that the evaluator remains unaware of the decoding algorithms employed in the two generations or the human writing in advance. The evaluator determines the superior outcome based on three guiding principles. Our application for human evaluation is illustrated in Figure 7.

  • Consistency: The extent to which the produced text maintains semantic coherence with the given context.

  • Smoothness: The degree to which the generated text flows seamlessly and is readily comprehensible.

  • Richness: The presence of diverse and captivating content in the generated text

Appendix G Case Study

Given the prefix sampled from WikiText-103, which provides relevant enough context, we give the truncated human continuation and the generation of our proposed adaptive decoding. Cases can be found in Table 12 and Table 13.

Table 12: Case study: Text continuation with the GPT2-XL model. Entities are in blue fonts, logical relationships are in red, and coherent contents are highlighted in green.

Prefix: Following a period of rest in England, Little turned down a desk assignment and volunteered to return to action on the Western Front, joining Lieutenant Colonel Collishaw’s
Human Continuation: No. 3 Squadron RNAS in March 1918. The unit evolved into No. 203 Squadron of the new Royal Air Force on 1 April, formed after the merger of the RNAS and the Royal Flying Corps. Now ranked captain, and again flying Sopwith Camels, Little gained a further nine successes, beginning with a Fokker Triplane on 1 April, and concluding with two kills in one day on 22 May, an Albatros and a DFW. During this stretch of victories, on 21 April 1918, he was brought down unharmed by Friedrich Ehmann. On 27 May, Little received reports of German Gotha bombers in the vicinity, and took off on a moonlit evening to intercept the raiders. As he closed with one of the bombers, his plane was caught in a searchlight beam and he was struck by a bullet that passed through both his thighs. He crash-landed in a field near Nœux, and bled to death before he was discovered the following morning by a passing gendarme. Little’s skull and ankle had also been fractured in the impact; his body was identified by his friend and fellow ace, Charles Dawson Booker.
Adaptive decoding: 49th Rifles. On his first day back, he led a cavalry charge which was repulsed in the village of Culmore Wood in June 1916, while advancing across an open plain on the company’s way to Vimy Ridge. In late 1916, He had been promoted from sergeant and given a smaller objective position in the near-by village of Meadowhall. His company was soon attacked by a cavalry squad of such ferocity that the archers and machine gunners could have opened up at once. Little was bitten, painfully, by a cavalry lance. He received the nicknames ”Noire” and ”The Snake”, because of his long knife, thin face, and reptilian nature. He later hired a pet donkey to pull his belongings back to camp at the end of each day’s march. In late May 1917, three weeks after the Heed affair, Little was deployed for the first time in front of the Hindenburg Line when, as part of the Anglo-French alliance, his squad was to capture and destroy German flak guns in Belgium. As if on cue, two machine gun crews came through a broken field and attacked Little’s men, who retreated through a canal.
Prefix: The total population size of the African river martin is unknown. In the late 1980s, it appeared to be common, if local, and large numbers were
Human Continuation: seen on migration in Gabon. However, it is particularly poorly studied in the Democratic Republic of the Congo (DRC), and it is not known if there is any relationship between the birds breeding in the DRC and those breeding in coastal areas of Gabon and Congo. Several hundred birds were seen at Conkouati-Douli National Park in Congo in 1996 and a flock of 15,000 birds was seen in Gabon in 1997. Investigations in Gabon in September 2003 extended the known range of this martin. More than 300 birds were found with hundreds of rosy bee-eaters in the Omboué area and the newly established Iguéla National Park, and a mixed flock with rosy bee-eaters at Loango National Park in Gabon was estimated at 100,000 birds in total. In January 2010, 250 martins were seen at a new location in Bakoumba, and single birds were observed in hirundine flocks near Moanda, at Mounana, and at Lekoni.
Adaptive decoding: seen during spring breeding in the Cape province of South Africa. The species is also common in South America, but these differences between species and range regions is unclear. The African river martin is a robust and broad-bodied species that has a dark body and light face, which enhances its camouflage. Its chest is often painted or spotted with black. The coloration and markings can differ between individuals. They have very large ears and a long tail, and in female martins have a pattern that resembles a dorsal stripe, which appears more yellow than brown when united with dark bars, thus giving them the common name of yellow-backed or yellow-fronted. The tail is large, and in males has a faint red pink spot at the base that is very difficult to see unless the animal is under a light. The neck varies from strong to weak, and males often have longer teeth.
Prefix: The water temperature of Briar Creek near its source ranges from 35.96 to 64.94°F (2.20 to 18.30°C),
Human Continuation: with a median of 51.44°F (10.80°C). This is the highest minimum temperature of any studied location in the watershed. Near West Branch Briar Creek’s confluence with Briar Creek, the latter creek has a temperature range of 32 to 67.46°F (0.00 to 19.70°C) and a median of 49.28°F (9.60°C). This is the lowest minimum temperature of any studied location in the drainage basin. Near its mouth, the temperature ranges from 32.18 to 80.96°F (0.10 to 27.20°C). The median temperature is 57.56°F (14.20°C). On West Branch Briar Creek, the temperature is a minimum of 35.6°F (2.0°C), a median of 57.38°F (14.10°C), and a maximum of 71.6°F (22.0°C). Two miles downstream of East Branch Briar Creek’s headwaters, the median temperature is 49.73°F (9.85°C) and the range is 32.18 to 71.42°F (0.10 to 21.90°C).
Adaptive decoding: with a seasonal mean range of 40.52°F (3.68°C). Seasonal variation in temperature, as well as production of sulfides, can significantly affect food source quality, quality of water, and the quality and quantity of sediment that may be trapped in the porous pipes and culverts that lead into groundwater. Relatively constant temperatures between 40°F (4°C) and 55°F (13°C) are common in the watershed. Temperatures above these extremes may occur occasionally when the watershed is dry, but the average annual maximum temperature in Briar Creek is 52.92°F (12.55°C). Average summer temperatures on the creek’s western slopes are generally less than 60°F (15°C) throughout most of the year. Sustaining temperature above 60°F (15°C) is usually limited to the summer months and are relatively infrequent. Briar Creek drains nine watersheds with areas in the northern half of the watersheds containing more than 70% of the plants (wheat, barley, rice, rye and bulgur), but the predominant tree species (huckleberry, aspen, birch, and pine) is oak.
Table 13: Case study: Text continuation with the Llama2-7B model. Entities are in blue fonts, logical relationships are in red, and coherent contents are highlighted in green.

Prefix: Liszt found his method through two compositional practices, which he used in his symphonic poems. The first practice was cyclic form, a
Human Continuation: procedure established by Beethoven in which certain movements are not only linked but actually reflect one another’s content. Liszt took Beethoven’s practice one step further, combining separate movements into a single-movement cyclic structure. Many of Liszt’s mature works follow this pattern, of which Les préludes is one of the best-known examples. The second practice was thematic transformation, a type of variation in which one theme is changed, not into a related or subsidiary theme but into something new, separate and independent.
Adaptive decoding: form that repeats the thematic material several times in different musical settings, as with the work of Haydn and Beethoven. It also was used in works such as Wagner’s opera cycle, Der Ring des Nibelungen and Brahms’ Alto Rhapsody. Liszt’s second practice was program music, which uses musical form in order to support programmatic concepts. Liszt applied both compositional practices to his symphonic poems, using cyclic form to construct musical episodes that reflect programmatic ideas. The music reflects the program throughout the piece.
Prefix: Rebellions can occur if the loyalty of a particular province falls too low, with a rebel army appearing in the province to attempt to assume control
Human Continuation: from the owners. Civil wars may also take place if several generals commanding large armies have sufficiently low loyalty. In the event of a civil war, the player is given the choice to back either the current rulers or the rebels. It had been planned to allow other factions who had established a prior claim to the throne by marriage to princesses to join in a civil war to claim the throne for themselves; however, this was never implemented. Naval warfare is carried out upon the campaign map, where ships can be built and organised into fleets. These fleets can be used to control the game’s sea regions and form sea lanes, allowing trade and troop movement between provinces that have constructed a port. Fleets can engage in sea battles with foreign fleets, although unlike land battles these are resolved by the computer. Religion plays an important aspect in Medieval: Total War, with religious differences between the Catholic, Orthodox and Muslim factions affecting diplomacy and population loyalty.
Adaptive decoding: . If the loyalty falls too far, the province will secede, and become a separate, unallied nation (this may sometimes be undesired, so it is important to know how to increase loyalty and to always pay attention to the needs of all provinces). As a province is loyal, its loyalty will be decreased by any war that the province takes part in, and will be increased by having any kind of benefit in the province. As a result of the fall of the West Roman Empire, most of the Western Mediterranean provinces were either conquered or claimed by the Christian Roman States, or by Germanic tribes such as the Visigoths, Ostrogoths or Vandals. The Eastern Mediterranean however is largely in the hands of Arabic Caliphates, or the Empire of Japan. The Western Mediterranean is the primary area where player wars begin, although it is possible to war in the eastern mediterranean, the arctic, or most recently, in the Indian Ocean. The Western Mediterranean however is also the richest area in the game, with a large amount of resources and a large number of wealthy cities which can be attacked.
Prefix: In 1947 Bessie was elected to the Labour Party’s National Executive Committee (NEC). She was generally identified with the left
Human Continuation: wing of the party, and was for a time associated with a grou** known as the ”Socialist Fellowship”, which espoused a programme of colonial freedom, workers’ control and reduced arms expenditure. She resigned from the Fellowship in 1950, along with fellow-MPs Fenner Brockway and Ellis Smith, when it condemned the United Nations intervention in the Korean War. She continued her wholehearted campaigning on behalf of the poorest in the country, pleading with parliament to ”remember the queues outside the Poor Relief offices”, and castigating the ”New Look” fashion of 1948 as wasteful, ”the ridiculous whim of idle people”. Bessie’s fiery reputation did not harm her electorally; in the February 1950 general election, with the Exchange constituency greatly increased by boundary changes, her majority rose to 5,344. Nationally Labour lost 76 seats, and its parliamentary majority was reduced to five. Attlee’s second government was short-lived;
Adaptive decoding: wing of the Labour Party. She was also active in organisations supporting Jewish refugees from the Nazis. With the help of another Jewish member of the Labour Party, Maurice Jacobson, Bessie had her own radio programme (Bessie Braddock’s Book) and also became a popular speaker at meetings. In 1947 she was approached by the National Union of General and Municipal Workers (NUGM) and asked if she would run for parliament in an election. In the general election of 1950 Bessie Braddock was elected as Member of Parliament for Liverpool Exchange. She stood for the National Union of General and Municipal Workers, Labour Co-Operative and the Socialist Medical Association (of which she was a founder member). Her election campaign became more and more left wing. In Parliament she was very outspoken and unpredictable, using colourful language in her speeches. A political opponent once described her as ”the only man in the House of Commons”. In 1954 she was appointed as parliamentary private secretary to Hugh Gaitskell.