Measuring and Controlling Instruction (In)Stability
in Language Model Dialogs

Kenneth Li1111Correspondence to: Kenneth Li <[email protected]>., Tianle Liu1, Naomi Bashkansky1,
David Bau2, Fernanda Viégas1, Hanspeter Pfister1, Martin Wattenberg1
1Harvard University, 2Northeastern University
Abstract

System-prompting is a standard tool for customizing language-model chatbots, enabling them to follow a specific instruction. An implicit assumption in the use of system prompts is that they will be stable, so the chatbot will continue to generate text according to the stipulated instructions for the duration of a conversation. We propose a quantitative benchmark to test this assumption, evaluating instruction stability via self-chats between two instructed chatbots. Testing popular models like LLaMA2-chat-70B and GPT-3.5, we reveal a significant instruction drift within eight rounds of conversations. An empirical and theoretical analysis of this phenomenon suggests the transformer attention mechanism plays a role, due to attention decay over long exchanges. To combat attention decay and instruction drift, we propose a lightweight method called split-softmax, which compares favorably against two strong baselines. Code: https://github.com/likenneth/persona_drift.

1 Introduction

Refer to caption
Figure 1: An example of instruction drift on gpt-3.5-turbo-16k. Although the chatbot initially follows the system prompt well, it fails when the same question is asked again after an extended conversation. Any LLM user might relate to this issue.

A popular way to control chatbot outputs is to insert a system prompt—a special piece of text—at the beginning of a dialog Radford et al. (2019). The hope is that the right prompt (e.g., “You are a rockstar programmer who always writes comments”) will customize the language model’s behavior for a particular purpose (e.g., producing clear, correct code). Indeed, Wang et al. (2023) find that asking an LLM to act as an expert can lead it to perform a task better as if the play-acting causes the LLM to become a genuine expert.

We may view the initial prompt as causing the chatbot to follow a certain instruction, that is, having a specific, coherent behavior. Informally, this may correspond to a specific personality or directly relate to the semantics of the output (as above, for a coding chatbot, a prompt that stipulates it should always write comments). It may also be related to aspects that are orthogonal to the semantics (e.g., a prompt specifying “Always respond with a haiku”).

This paper explores whether chatbots maintain prompted behavior over lengthy dialogs. Anecdotal evidence suggests that instruction stability may “degrade” over the course of a dialog, with chatbot responses straying from what was specified by the prompt. Besides being a potential problem for prompt engineering, the lack of instruction stability also carries significant safety implications. When the chatbot drifts away from its system prompts that stipulate safety aspects, it becomes more susceptible to jailbreaking and more prone to hallucinations.

To measure instruction stability, we introduce a benchmark to quantitatively characterize the phenomenon of instruction drift. Unlike previous work that evaluated instruction following in single-round conversation (question answering) (Ganguli et al., 2022; Skopek et al., 2023; Zhou et al., 2023), our experimental protocol focuses on long-form conversations. We test LLaMA2-chat-70B and find it suffers a significant instruction drift, as shown in Figure 3. This discovery leads us to investigate the cause of the drift and to propose a mitigation method.

A natural guess is that instruction drift relates to the transformer attention mechanism. When a chatbot generates a new token, it takes into account all previous tokens in the dialog but with varying weights. One might speculate that the longer the dialog, the less weight is placed on the initial tokens that make up the prompt. We measure this effect precisely and find that there is indeed a strong attention decay effect. Intuitively, it seems plausible that the prompt’s efficacy will decrease as attention to initial tokens wanes. We back up this intuition mathematically by showing that, in an idealized model, the space of possible outputs from a language model will steadily enlarge over time.

Refer to caption
Figure 2: An illustration of the proposed evaluation pipeline of instruction stability. (A) Initially, two language models engage in a conversation: the simulated user LM (red, A), guided by system prompt sAsubscript𝑠𝐴s_{A}italic_s start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, and the agent LM (purple, B), with system prompt sBsubscript𝑠𝐵s_{B}italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. The user LM begins the conversation with a randomly selected starter prompt a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. (B) After the conversation reaches a preset length (8 rounds in our experiment), we test how the agent LM adheres to its system prompt sBsubscript𝑠𝐵s_{B}italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. At each turn i𝑖iitalic_i, we replace the original user message aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the conversation history with the probe question pBsubscript𝑝𝐵p_{B}italic_p start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT and ask the agent LM to generate its answer for a second time. The answer is then judged by the stability measure fB()subscript𝑓𝐵f_{B}(\cdot)italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( ⋅ ) to compute the stability score.

Finally, given the new understanding of instruction drift, we make a first step towards controlling it. We propose split-softmax, a training-free and parameter-free method that amplifies the model’s attention to the system prompt at inference time. By comparing it with a strong prompting-based baseline and a recent technique from the literature (Sanchez et al., 2023), we demonstrate how split-softmax provides a better trade-off between performance and stability.

This paper presents four contributions. (1) We provide a quantitative benchmark for evaluating instruction drift that does not depend on human annotation or API calls to proprietary LLMs. This reproducible benchmark enables the measurement of progress in controlling instruction drift for both open- and closed-source models (Section 3); (2) We discuss the phenomenon of attention decay and theoretically explain why it may occur (Sections 4 and 4.3); (3) We hypothesize that attention decay is the cause of instruction drift and devise a simple technique called split-softmax as a first step towards controlling it (Section 5.2); (4) Using our benchmark, we show that split-softmax provides a better trade-off between instruction stability and performance compared to two baselines.

2 Related Work

Prompting

Prompting has become the go-to method for adapting language models to downstream use cases. Among the more popular techniques are in-context learning (Min et al., 2022) and chain-of-thought prompting (Wei et al., 2022). Despite being flexible, prompting cannot match the performance of fine tuning (Mosbach et al., 2023; Lu et al., 2021). For dialog systems based on large language models, a system prompt is placed at the beginning of context window to define the general behavior of the chatbot. In the line of prompting, we test a simple remedy that repeats the system prompt many times before each user utterance in Section 5.

Instruction Tuning

Instruction tuning has been widely adopted to further align the model to task instructions after pre-training (Gupta et al., 2022; Wei et al., 2021). Given pairs of inputs and outputs that follow the instruction, the model is fine-tuned to generate the desired output. For the purpose of mitigating instruction drift, instruction tuning has played a major role, especially in addressing safety concerns using RLHF Ouyang et al. (2022). However, instruction tuning has a high cost of collecting training data and is not as flexible as prompting.

Controlled Decoding

System Prompt sAsubscript𝑠𝐴s_{A}italic_s start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT You are very happy! Always respond with lots of joy.
System Prompt sBsubscript𝑠𝐵s_{B}italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT Always reply in French.
Conversation Starter a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT What’s your take on celebrity culture?
Probe Question pBsubscript𝑝𝐵p_{B}italic_p start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT What do you do in London as a tourist?
Stability measure fB()subscript𝑓𝐵f_{B}(\cdot)italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( ⋅ ) [Uncaptioned image]
Table 1: Examples of required material for our experimental protocol.

Controlled decoding methods can be adapted to avoid instruction drift. Instead of changing the model parameters, these methods modify the inference process to alter the token distribution Shen et al. (2017); Dathathri et al. (2019); Krause et al. (2020); Li et al. (2023a). For example, for a certain prompt, Todd et al. (2023) find a set of function vectors in the model’s hidden space that could be added to novel prompts to steer the model outputs. This can be thought of as a way to distill the prompt without repeating it in the context window. Weston & Sukhbaatar (2023) propose System-2 attention, where the language model first decides where to attend to before making the final responses. Classifier-free guidance (CFG) (Sanchez et al., 2023) works by running the model twice, once with and once without the system prompt, and computing the next token distribution by a scaled contrast of the two distributions. We will evaluate CFG in our experiments in Section 5.

Studies of Instruction Following in Dialog Systems

Li et al. (2023b); Wu et al. (2023) study the problem the instruction following capability of large language models under adversarial scenarios. Concurrent to this work, Zhou et al. (2023) use verifiable prompts to evaluate the instruction-following capabilities of language models. However, they all focus on one-turn situations without user input. Zeng et al. (2023) emphasize the difficulty for language model to evaluate instruction-following even using close-source language models, motivating us to use deterministic functions for evaluation.

3 Measuring Instruction Drift

We aim to quantify instruction drift without the need for human judgment or API calls of proprietary LLMs. To that end, we introduce a simple experimental protocol, along with a benchmark dataset.

3.1 Experimental Protocol

The idea behind the protocol is straightforward: to measure instruction drift, we create a synthetic dialog between two chatbots A𝐴Aitalic_A and B𝐵Bitalic_B and evaluate how far the dialog [a1,b1,a2,b2,]subscript𝑎1subscript𝑏1subscript𝑎2subscript𝑏2[a_{1},b_{1},a_{2},b_{2},...][ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … ] drifts from the original prompts. To automate this process, we need four elements: two system prompts sAsubscript𝑠𝐴s_{A}italic_s start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, sBsubscript𝑠𝐵s_{B}italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, a conversation starter a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, a probe question pBsubscript𝑝𝐵p_{B}italic_p start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT, and a stability measure fB(bi)subscript𝑓𝐵subscript𝑏𝑖f_{B}(b_{i})italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).  Table 1 shows an example set of these elements.

The protocol consists of the following two steps ( Figure 2):

  1. 1.

    Given the two system prompts, sAsubscript𝑠𝐴s_{A}italic_s start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT for the user LM and sBsubscript𝑠𝐵s_{B}italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT for the agent LM, we pit two copies of the same chatbot against each other but with different system prompts, as specified by their different system prompts. The agent LM is the agent under test for its instruction stability. We then create a synthetic multi-round dialog between the two chatbot instances by feeding each one’s response to the other. The user LM speaks first with a randomly sampled conversation starter a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Such simulation yields a conversation history {(ai,bi)}i=1Nsuperscriptsubscriptsubscript𝑎𝑖subscript𝑏𝑖𝑖1𝑁\{(a_{i},b_{i})\}_{i=1}^{N}{ ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where N𝑁Nitalic_N is the total number of rounds***A “turn” is one utterance like a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT; a “round” is when each chatbot takes a turn like a2,b2subscript𝑎2subscript𝑏2a_{2},b_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We use N=8𝑁8N=8italic_N = 8 in our experiments.

  2. 2.

    To measure how well the agent LM follows its system prompt during the course of the conversation, in the i𝑖iitalic_i-th round, the user LM, instead of making its original prompt aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, asks the predefined probe question pBsubscript𝑝𝐵p_{B}italic_p start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. Checking the returned answer bisuperscriptsubscript𝑏𝑖b_{i}^{\prime}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with fB()subscript𝑓𝐵f_{B}(\cdot)italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( ⋅ ), we get a quantitative indication of how well the original system prompt sBsubscript𝑠𝐵s_{B}italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT is followed. We call fB(bi|ai=pB)subscript𝑓𝐵conditionalsuperscriptsubscript𝑏𝑖subscript𝑎𝑖subscript𝑝𝐵f_{B}(b_{i}^{\prime}|a_{i}=p_{B})italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) instruction stability. The stability measure function can be Python code that calls a library to determine the confidence that a reply is in French.

The result is a quantitative measurement of instruction stability for the agent LM over the course of a single conversation.

3.2 Benchmark Dataset

Of course, no single conversation can yield statistically significant results. To assess the degree to which a chatbot is vulnerable to instruction drift, we need to average the results of many conversations. We manually curate a benchmark set of 100100100100 system prompts, categorized into 5555 categories: multi-choice responses, character of the agent, answer-string format pattern, memorization of certain facts, and languages the agent speaks. Each system prompt sBsubscript𝑠𝐵s_{B}italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT comes with its own probe question pBsubscript𝑝𝐵p_{B}italic_p start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT and stability measure fB()subscript𝑓𝐵f_{B}(\cdot)italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( ⋅ ), expressed as a Python function. Each stability measure fB()subscript𝑓𝐵f_{B}(\cdot)italic_f start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( ⋅ ) takes as input the agent LM’s response bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and returns a number p𝑝pitalic_p in the range 0p10𝑝10\leq p\leq 10 ≤ italic_p ≤ 1 deterministically; the larger the value of p𝑝pitalic_p, the better the system prompt is followed. Table 1 shows one such triplet of system prompt, probe question, and stability measure. We will release the full dataset as well as the conversation starters we use.

3.3 Experimental Results

Refer to caption
Figure 3: (A) The phenomenon of instruction drift. As the interaction progresses, not only does the agent LM lose stability to its original system prompt, but it also begins to adopt the instruction of the simulated user LM. The effects were measured on 200200200200 randomly sampled pairs of system prompts on LLaMA2-chat-70B using the procedure shown in Figure 2. The error bar represents one standard deviation. (B) Measuring instruction stability of the agent LM when user LM’s system prompt is set to an empty string.

We use this protocol and benchmark data to measure instruction drift in LLaMA2-chat-70B and gpt-3.5-turbo-16k (Appendix D). Averaging the instruction stability scores across 200200200200 conversations configured with random pairs of system prompts, we arrive at the blue line in Figure 3 A. We observe that the agent LM gradually stops following its system prompts, aligning with our empirical daily usage experiences.

As a side experiment, we are curious if the agent LM adopts the user LM’s system prompt. This is plausible since the user LM’s utterances generated according to pAsubscript𝑝𝐴p_{A}italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT have a strong appearance in the context window. For this purpose, we swap aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with pAsubscript𝑝𝐴p_{A}italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and check fA(bi|ai=pA)subscript𝑓𝐴conditionalsuperscriptsubscript𝑏𝑖subscript𝑎𝑖subscript𝑝𝐴f_{A}(b_{i}^{\prime}|a_{i}=p_{A})italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ). Surprisingly, the agent LM even gradually adopts the instruction of the user LM over extended rounds of conversation, as shown by the orange line in Figure 3 A. This could potentially be exploited by adversarial attacks, raising serious safety concerns.

In another safety check (Figure 3 B), we ablate the system prompt of the user LM with an empty string, so it falls back to the default mode of the underlying language model. This rules out the possibility that this could contribute to the significant instruction drift discovered earlier.

Experiment details.

We use LLaMA2-chat-70B for this experiment and follow the format of composing input sequence from Touvron et al. (2023). Taking the perspective of agent LM as an example, the input sequence looks like [sB,a1,b1,,ai1,bi1,ai]subscript𝑠𝐵subscript𝑎1subscript𝑏1subscript𝑎𝑖1subscript𝑏𝑖1subscript𝑎𝑖[s_{B},a_{1},b_{1},\dots,a_{i-1},b_{i-1},a_{i}][ italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ], and it is tasked with generating bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as a reply to the last utterance from user LM.Omitting formatting tokens like <s>, <<SYS>> or [INST]. Each s𝑠sitalic_s, a𝑎aitalic_a, and b𝑏bitalic_b here is a string and may contain multiple tokens. Generation is performed with temperature 1.01.01.01.0 and nucleus sampling with p=0.9𝑝0.9p=0.9italic_p = 0.9 (Holtzman et al., 2019).

4 Attention Decay: a Hypothesis

It is reasonable to hypothesize that instruction drift results from a decaying influence of the prompt over time. To investigate why this happens, we focus on the attention distribution over context tokens in transformer self-attention heads. Although the intuitive hypothesis broadly captures the underlying phenomenon, our empirical and theoretical analyses uncover nuanced discrepancies.

Refer to caption
Figure 4: The phenomenon of attention decay demonstrated in the 11111111th attention head in the 24242424th layer of LLaMA2-7B, which has a maximum context window size of 4,09640964,0964 , 096 tokens. We generate 12121212 conversations while tracking the portion of attention allocated to system prompt tokens. The plots are specifically for the agent LM, grouped by the rounds in which the answers are generated; the values are absent for the user LM. We observe sharp drops in attention between turns and rough plateaus within turns.

4.1 Preliminaries

Suppose the input tokens are {wi}i=1tsuperscriptsubscriptsubscript𝑤𝑖𝑖1𝑡\{w_{i}\}_{i=1}^{t}{ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, each belonging to the vocabulary V𝑉Vitalic_V. To generate the next token wt+1Vsubscript𝑤𝑡1𝑉w_{t+1}\in Vitalic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∈ italic_V, the current tokens are first embedded into D𝐷Ditalic_D-dimensional vectors {hi0}i=1tsuperscriptsubscriptsuperscriptsubscript𝑖0𝑖1𝑡\{h_{i}^{0}\}_{i=1}^{t}{ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT with the embedding matrix We|V|×Dsubscript𝑊𝑒superscript𝑉𝐷W_{e}\in\mathbb{R}^{|V|\times D}italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | italic_V | × italic_D end_POSTSUPERSCRIPT. These are then processed sequentially by L𝐿Litalic_L transformer layers, resulting in a grid of activations after each layer and for each token {hil}i=1,l=1t,Lsuperscriptsubscriptsuperscriptsubscript𝑖𝑙formulae-sequence𝑖1𝑙1𝑡𝐿\{h_{i}^{l}\}_{i=1,l=1}^{t,L}{ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 , italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t , italic_L end_POSTSUPERSCRIPT. As the multi-layer perception (MLP) and layer norm are context-independent, we leave them out for simplicity. The feed-forward process of the transformer can be summarized as:

hil=hil1+superscriptsubscript𝑖𝑙limit-fromsuperscriptsubscript𝑖𝑙1\displaystyle h_{i}^{l}=h_{i}^{l-1}+italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT + m=1HWol,mAttl,m(h1l1,,hil1),superscriptsubscript𝑚1𝐻superscriptsubscript𝑊𝑜𝑙𝑚superscriptAtt𝑙𝑚superscriptsubscript1𝑙1superscriptsubscript𝑖𝑙1\displaystyle\sum_{m=1}^{H}W_{o}^{l,m}\mathrm{Att}^{l,m}(h_{1}^{l-1},\ldots,h_% {i}^{l-1}),∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_m end_POSTSUPERSCRIPT roman_Att start_POSTSUPERSCRIPT italic_l , italic_m end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ) , (1)
wt+1similar-tosubscript𝑤𝑡1absent\displaystyle w_{t+1}\simitalic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ p(w|wt)=softmax(WehtL).𝑝conditional𝑤subscript𝑤absent𝑡softmaxsubscript𝑊𝑒superscriptsubscript𝑡𝐿\displaystyle\,p(w|w_{\leq t})=\mathrm{softmax}(W_{e}\,h_{t}^{L}).italic_p ( italic_w | italic_w start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT ) = roman_softmax ( italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ) . (2)

The combination of the softmaxsoftmax\mathrm{softmax}roman_softmax and Wesubscript𝑊𝑒W_{e}italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT work as a predictor from htLsuperscriptsubscript𝑡𝐿h_{t}^{L}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT to distribution p(w|wt)𝑝conditional𝑤subscript𝑤absent𝑡p(w|w_{\leq t})italic_p ( italic_w | italic_w start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT ) of next token wt+1subscript𝑤𝑡1w_{t+1}italic_w start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. Attl,msuperscriptAtt𝑙𝑚\mathrm{Att}^{l,m}roman_Att start_POSTSUPERSCRIPT italic_l , italic_m end_POSTSUPERSCRIPT is the single head attention operator with output in a lower dimensional space and Wol,mD×dsuperscriptsubscript𝑊𝑜𝑙𝑚superscript𝐷𝑑W_{o}^{l,m}\in\mathbb{R}^{D\times d}italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_m end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_d end_POSTSUPERSCRIPT maps them back into Dsuperscript𝐷\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, the residual stream space.

Crucial to our experiment, we expand the attention operator to show it aggregates activations from previous time steps based on an attention distribution:

αt,j=1:tl,m=softmax((Wkl,mh1:tl1)(Wql,mhtl1)d).superscriptsubscript𝛼:𝑡𝑗1𝑡𝑙𝑚softmaxsuperscriptsuperscriptsubscript𝑊𝑘𝑙𝑚superscriptsubscript:1𝑡𝑙1topsuperscriptsubscript𝑊𝑞𝑙𝑚superscriptsubscript𝑡𝑙1𝑑\displaystyle\alpha_{t,j=1:t}^{l,m}=\mathrm{softmax}\left(\frac{(W_{k}^{l,m}h_% {1:t}^{l-1})^{\top}(W_{q}^{l,m}h_{t}^{l-1})}{\sqrt{d}}\right).italic_α start_POSTSUBSCRIPT italic_t , italic_j = 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_m end_POSTSUPERSCRIPT = roman_softmax ( divide start_ARG ( italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_m end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_m end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ) end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) . (3)

Then the attention operation is a weighted sum of linearly transformed activations from the last layer:

Attl,m(h1l1,,htl1)=j=1tαt,jl,m(Wvl,mhjl1),superscriptAtt𝑙𝑚superscriptsubscript1𝑙1superscriptsubscript𝑡𝑙1superscriptsubscript𝑗1𝑡superscriptsubscript𝛼𝑡𝑗𝑙𝑚superscriptsubscript𝑊𝑣𝑙𝑚superscriptsubscript𝑗𝑙1\displaystyle\mathrm{Att}^{l,m}(h_{1}^{l-1},\ldots,h_{t}^{l-1})=\sum_{j=1}^{t}% \alpha_{t,j}^{l,m}\left(W_{v}^{l,m}\,h_{j}^{l-1}\right),roman_Att start_POSTSUPERSCRIPT italic_l , italic_m end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_m end_POSTSUPERSCRIPT ( italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_m end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ) , (4)

where Wvl,md×D,Wkl,md×D,Wql,md×Dformulae-sequencesuperscriptsubscript𝑊𝑣𝑙𝑚superscript𝑑𝐷formulae-sequencesuperscriptsubscript𝑊𝑘𝑙𝑚superscript𝑑𝐷superscriptsubscript𝑊𝑞𝑙𝑚superscript𝑑𝐷W_{v}^{l,m}\in\mathbb{R}^{d\times D},W_{k}^{l,m}\in\mathbb{R}^{d\times D},W_{q% }^{l,m}\in\mathbb{R}^{d\times D}italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_m end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_D end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_m end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_D end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_m end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_D end_POSTSUPERSCRIPT are the value, key, and query weight matrices, respectively.

4.2 The Phenomenon of Attention Decay

While generating the next token given an input sequence containing t𝑡titalic_t tokens, in each attention head, the last token will compute a normalized attention distribution over all previous tokens (including itself), denoted by αt,i=1:tsubscript𝛼:𝑡𝑖1𝑡\alpha_{t,i=1:t}italic_α start_POSTSUBSCRIPT italic_t , italic_i = 1 : italic_t end_POSTSUBSCRIPT in Equation 3. Tokens in the system prompt are a special subset of all previous tokens, and we denote the sum of the attention weights allocated to them as π(t)=i=1|sB|αt,i𝜋𝑡superscriptsubscript𝑖1subscript𝑠𝐵subscript𝛼𝑡𝑖\pi(t)=\sum_{i=1}^{|s_{B}|}\alpha_{t,i}italic_π ( italic_t ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT. It ranges between 00 to 1111 and represents the comparative importance that the system prompt has throughout the generation process. We monitor this percentage π(t)𝜋𝑡\pi(t)italic_π ( italic_t ) along the decoding time steps t𝑡titalic_t and across turns of conversations in LLaMA2-7B. We only plot π(t)𝜋𝑡\pi(t)italic_π ( italic_t ) from the perspective of the agent LM.

As shown in Figure 4, within each turn, π(t)𝜋𝑡\pi(t)italic_π ( italic_t ) remains almost constant, but there are significant decreases across turns. This observation runs our a naive hypothesis of attention decay—if the attention distributes uniformly over previous tokens, π(t)𝜋𝑡\pi(t)italic_π ( italic_t ) should decay hyperbolically and be independent of number of turns.

It’s also worth-noting that this highlights a unique issue in chatbots, distinct from language models, where out-of-distribution text from interlocutors is absent. The case of the language model completing its input partial sequence is technically equivalent to the agent LM generating answers for a single turn, which displays a plateau in π(t)𝜋𝑡\pi(t)italic_π ( italic_t ).

This observation shows merely the co-occurrence of instruction drift and attention decay. However, it inspires the hypothesis that attention decay may internally contribute to instruction drift, suggesting that addressing the former could help mitigate the latter (Section 5.2).

4.3 A Geometric View of Attention Decay

To shed light on attention decay in Figure 4, both the plateau within utterance and the drop across utterances, we provide a theoretical explanation in a simplified situation. Liang et al. (2022) show empirically and theoretically that the internal representation of deep neural networks usually live in a narrow cone in the high-dimensional space. Motivated by their observations, we characterize attention decay from a similar geometric perspective.

We will consider two settings of model generation:

  1. 1.

    New tokens are generated autoregressively given initial tokens h1,,h|sB|subscript1subscriptsubscript𝑠𝐵h_{1},\ldots,h_{|s_{B}|}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | end_POSTSUBSCRIPT, which models the process of the agent LM generating answers;

  2. 2.

    New tokens are drawn by the user. A user LM could put out-of-distribution tokens into the context window of agent LM in a potentially adversarial fashion (Zou et al., 2023).

For the first setting, we will show that tokens generated by the model always remain in an approximately low-dimensional convex cone in Theorem A.1. In the second setting, we can characterize the expansion using spherical measure and show that randomly drawn tokens will lead to an expansion of the underlying convex cone with the growth of intrinsic dimension of token embeddings, as shown in Proposition A.3. More details in Appendix A.

5 Mitigating Instruction Drift

If instruction drift is related to attention decay, that suggests we can mitigate drift by manipulating the level of attention on the original prompt. Before presenting an attention-based mitigation method, however, we describe two baselines.

5.1 Baseline Methods

System Prompt Repetition (SPR)

We inject the system prompt with probability 0p10𝑝10\leq p\leq 10 ≤ italic_p ≤ 1 before each user utterance. The repeated system prompts, like the standard system prompt at the start of the input sequence, only appear when the language model is prompted; users do not see them.

Classifier-Free Guidance (CFG)

The second method is classifier-free guidance (CFG, Sanchez et al., 2023), which runs the base model twice, firstly with system prompt to get logp(w|wt,sB)𝑝conditional𝑤subscript𝑤absent𝑡subscript𝑠𝐵\log p(w|w_{\leq t},s_{B})roman_log italic_p ( italic_w | italic_w start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) and then without system prompt to get logp(w|wt)𝑝conditional𝑤subscript𝑤absent𝑡\log p(w|w_{\leq t})roman_log italic_p ( italic_w | italic_w start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT ). It then uses a contrastive linear operation inside the logit space to strengthen the effects of the system prompt on answer generation. The new next-token probability distribution is defined by:

logp^(w|wt,sB)=logp(w|wt)+α(logp(w|wt,sB)logp(w|wt)).^𝑝conditional𝑤subscript𝑤absent𝑡subscript𝑠𝐵𝑝conditional𝑤subscript𝑤absent𝑡𝛼𝑝conditional𝑤subscript𝑤absent𝑡subscript𝑠𝐵𝑝conditional𝑤subscript𝑤absent𝑡\log\hat{p}(w|w_{\leq t},s_{B})=\log p(w|w_{\leq t})+\alpha(\log p(w|w_{\leq t% },s_{B})-\log p(w|w_{\leq t})).roman_log over^ start_ARG italic_p end_ARG ( italic_w | italic_w start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) = roman_log italic_p ( italic_w | italic_w start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT ) + italic_α ( roman_log italic_p ( italic_w | italic_w start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) - roman_log italic_p ( italic_w | italic_w start_POSTSUBSCRIPT ≤ italic_t end_POSTSUBSCRIPT ) ) . (5)

CFG comes with a hyperparameter α1𝛼1\alpha\geq 1italic_α ≥ 1 that controls how far we shift the predicted logits. When α=1𝛼1\alpha=1italic_α = 1, it reduces to prompting with the system prompt; larger α𝛼\alphaitalic_α produces stronger intervention.

5.2 Proposed Method: Split-softmax (SS)

Motivated by the attention decay phenomenon, we introduce a method that requires no retraining, split-softmax, aimed at reducing this decay with minimal overhead. The basic idea is straightforward: if the problem is that the model pays too little attention to the prompt, then force the model to pay more. In practice, we find that a power-law scaling of attention seems to be effective.

In particular, split-softmax (SS) works by inserting a scaling operation between Equation 3 and Equation 4 for every attention operation. After obtaining the attention distribution {αt,i}i=1tsubscriptsuperscriptsubscript𝛼𝑡𝑖𝑡𝑖1\{\alpha_{t,i}\}^{t}_{i=1}{ italic_α start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT which sums up to 1111 (omitting superscript for simplicity), we reweight it by:

π(t)=i=1|sB|αt,i,αt,i={πk(t)π(t)αt,iif i|sB|1πk(t)1π(t)αt,iif i>|sB|,formulae-sequence𝜋𝑡superscriptsubscript𝑖1subscript𝑠𝐵subscript𝛼𝑡𝑖subscriptsuperscript𝛼𝑡𝑖casessuperscript𝜋𝑘𝑡𝜋𝑡subscript𝛼𝑡𝑖if 𝑖subscript𝑠𝐵1superscript𝜋𝑘𝑡1𝜋𝑡subscript𝛼𝑡𝑖if 𝑖subscript𝑠𝐵\displaystyle\pi(t)=\sum_{i=1}^{|s_{B}|}\alpha_{t,i},\quad\alpha^{\prime}_{t,i% }=\begin{cases}\frac{\pi^{k}(t)}{\pi(t)}\alpha_{t,i}&\text{if }i\leq|s_{B}|\\ \frac{1-\pi^{k}(t)}{1-\pi(t)}\alpha_{t,i}&\text{if }i>|s_{B}|\end{cases},italic_π ( italic_t ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL divide start_ARG italic_π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_t ) end_ARG start_ARG italic_π ( italic_t ) end_ARG italic_α start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT end_CELL start_CELL if italic_i ≤ | italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 - italic_π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_t ) end_ARG start_ARG 1 - italic_π ( italic_t ) end_ARG italic_α start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT end_CELL start_CELL if italic_i > | italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | end_CELL end_ROW , (6)

where the introduced exponent 0k10𝑘10\leq k\leq 10 ≤ italic_k ≤ 1 as a hyperparameter to control the strength of our intervention. The smaller k𝑘kitalic_k is, the stronger the intervention is; when k=1𝑘1k=1italic_k = 1, the intervention is nullified. The new set of attention {αt,i}i=1tsubscriptsuperscriptsubscriptsuperscript𝛼𝑡𝑖𝑡𝑖1\{\alpha^{\prime}_{t,i}\}^{t}_{i=1}{ italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT sums up to 1111 as well and will replace {αt,i}i=1tsubscriptsuperscriptsubscript𝛼𝑡𝑖𝑡𝑖1\{\alpha_{t,i}\}^{t}_{i=1}{ italic_α start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT so that more attention is paid to the system prompt tokens. Given 0π(t)10𝜋𝑡10\leq\pi(t)\leq 10 ≤ italic_π ( italic_t ) ≤ 1, 0k10𝑘10\leq k\leq 10 ≤ italic_k ≤ 1 thus πk(t)π(t)1superscript𝜋𝑘𝑡𝜋𝑡1\frac{\pi^{k}(t)}{\pi(t)}\geq 1divide start_ARG italic_π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_t ) end_ARG start_ARG italic_π ( italic_t ) end_ARG ≥ 1, split-softmax increases the proportion of attention paid to system prompts. See Appendix E for more discussion.

Refer to caption
Figure 5: Comparing trade-offs between instruction stability and performance. For each of the three methods, we vary a hyperparameter that reflects the strength of the intervention. Each curve plots the effect on stability and performance over the hyperparameter sweep. Compared to two baselines (classifier-free guidance and system prompt repetition), split-softmax produces equal or higher stability for a given level of performance degradation.

5.3 Calibration Using Performance Drop on MMLU

Each method (split-softmax and the two baselines) represents a potentially large intervention; any instruction stabilization may come at the expense of other capabilities of the model. However, each method has a hyperparameter that corresponds to the strength of the intervention. To compare methods, therefore, we need to measure both the increase in instruction stability and the performance drop for various values of the relevant hyperparameter. This is analogous to measuring a precision-recall curve for a classifier.

To measure any performance changes, we use the Massive Multitask Language Understanding (MMLU, Hendrycks et al., 2020). To compare the different methods, look at the stability improvement at equal levels of performance drop. Swee** hyperparameters for each method allows us to measure and plot each method’s stability-performance curve, revealing different trade-offs between our stability metric and MMLU performance.

As expected, we do see an inverse relationship between performance and instruction stability in all three of our methods Figure 5. This corroborates earlier findings by Gu et al. (2024) that control methods over language model often come at the cost of general capability. The performance drop on MMLU should be thought of as a budget when correcting model behaviors, and two methods should only be compared on stability when their respective hyperparameters cause similar MMLU performance drop.

To quantify stability, we use a 16161616-turn conversation as described in Figure 2. We modify these conversations by applying each method to the agent LM. Then we probe the agent LM at each round to test its instruction stability in the same fashion as section 3. Stability is measured for individual turns, and the overall stability measure is the average of the stability at each turn of agent LM. Given the conversation history of agent LM under intervention, we sample one and ask questions from MMLU at an intermediate turn (the 4444th turn in our experiments); and the answers are used to calculate MMLU accuracy. Note that due to the added system prompt and chat history, the MMLU performance is different from what is reported by LLaMA team even without intervention (Touvron et al., 2023). However, only the difference between post- and pre-intervention performances is meaningful, as the primary purpose of using MMLU in our case is to calibrate the strength of the intervention.

5.4 Experimental Results

Refer to caption
Figure 6: Comparison of instruction stability across turns, with MMLU performance drop around the value of 0.50.50.50.5, for system prompt repetition (SPR), classifier-free guidance (CFG), and split-softmax (SS). The whisker represents one standard deviation.

All experiments are conducted on LLaMA2-70B-chat. To save computational cost, we choose one system prompt from each of the five categories, and run experiments over the total twenty ordered pairs of system prompts.

In Figure 5 we plot instruction stability versus performance drop on MMLU as we vary the strength hyperparameter for each method. In general, split-softmax presents a better trade-off between performance drop and instruction stability. It can match performance with system prompt repetition while avoiding using the additional context window. If more drop in performance on MMLU is allowed, split-softmax enables greater instruction stability.

In Figure 6, we break down the instruction stability measurement across turns. Similar to what Sanchez et al. (2023) show, classifier-free guidance helps the model adhere to the system prompt remarkably well for the first round of the conversation, but it does not generalize well into extended conversations. Both system prompt repetition and split-softmax demonstrate higher effectiveness in mitigating instruction drift, though they exhibit different trends. The former excels in regions with a larger number of turns, while the latter performs better at the beginning of the conversation. Note that system prompt repetition consumes a substantial portion of the context window.

6 Conclusions and Future Work

Our experiments indicate that instruction drift is a potentially significant issue for prompt engineering. To help address this challenge, we contribute a new protocol and benchmark to help measure this phenomenon, as well as an idealized mathematical model of its cause. In addition, we proposed a technique, split-softmax, that can help mitigate instruction drift, providing a better stability-performance trade-off than two existing baselines.

There is ample room for future work in this space. For example, it would be natural to explore making changes in architecture or to training to combat instruction drift. Furthermore, all the techniques we discussed involve an apparent trade-off between performance and reliability. Is this a necessary compromise, or are there methods that maintain instruction stability at no cost? It would also be good to deepen our theoretical understanding, adding realism to the idealized “cone” model of instruction drift that we proposed. Finding new ways to measure and prevent instruction drift is an important step in ensuring AI safety and reliability.

Acknowledgments

We thank Jiawei Zhou for useful discussions and feedback on the manuscript.

KL is supported by a fellowship from the Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University. DB is supported by a grant from Open Philanthropy. This work has been made possible in part by a gift from the Chan Zuckerberg Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial Intelligence. This work was partially supported by NSF grant IIS-1901030.

References

  • Blumenson (1960) LE Blumenson. A derivation of n-dimensional spherical coordinates. The American Mathematical Monthly, 67(1):63–66, 1960.
  • Dathathri et al. (2019) Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164, 2019.
  • Ganguli et al. (2022) Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
  • Gu et al. (2024) Jia-Chen Gu, Hao-Xiang Xu, Jun-Yu Ma, Pan Lu, Zhen-Hua Ling, Kai-Wei Chang, and Nanyun Peng. Model editing can hurt general abilities of large language models. arXiv preprint arXiv:2401.04700, 2024.
  • Gupta et al. (2022) Prakhar Gupta, Cathy Jiao, Yi-Ting Yeh, Shikib Mehri, Maxine Eskenazi, and Jeffrey P Bigham. Improving zero and few-shot generalization in dialogue through instruction tuning. arXiv preprint arXiv:2205.12673, 2022.
  • Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  • Holtzman et al. (2019) Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Ye** Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
  • Krause et al. (2020) Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. Gedi: Generative discriminator guided sequence generation. arXiv preprint arXiv:2009.06367, 2020.
  • Li et al. (2023a) Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341, 2023a.
  • Li (2010) Shengqiao Li. Concise formulas for the area and volume of a hyperspherical cap. Asian Journal of Mathematics & Statistics, 4(1):66–70, 2010.
  • Li et al. (2023b) Shiyang Li, Jun Yan, Hai Wang, Zheng Tang, Xiang Ren, Vijay Srinivasan, and Hongxia **. Instruction-following evaluation through verbalizer manipulation. arXiv preprint arXiv:2307.10558, 2023b.
  • Liang et al. (2022) Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems, 35:17612–17625, 2022.
  • Lu et al. (2021) Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786, 2021.
  • Min et al. (2022) Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022.
  • Mosbach et al. (2023) Marius Mosbach, Tiago Pimentel, Shauli Ravfogel, Dietrich Klakow, and Yanai Elazar. Few-shot fine-tuning vs. in-context learning: A fair comparison and evaluation. arXiv preprint arXiv:2305.16938, 2023.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • Sanchez et al. (2023) Guillaume Sanchez, Honglu Fan, Alexander Spangher, Elad Levi, Pawan Sasanka Ammanamanchi, and Stella Biderman. Stay on topic with classifier-free guidance. arXiv preprint arXiv:2306.17806, 2023.
  • Shen et al. (2017) Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. Style transfer from non-parallel text by cross-alignment. Advances in neural information processing systems, 30, 2017.
  • Skopek et al. (2023) Ondrej Skopek, Rahul Aralikatte, Sian Gooding, and Victor Carbune. Towards better evaluation of instruction-following: A case-study in summarization. arXiv preprint arXiv:2310.08394, 2023.
  • Todd et al. (2023) Eric Todd, Millicent L Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, and David Bau. Function vectors in large language models. arXiv preprint arXiv:2310.15213, 2023.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • Wang et al. (2023) Shuai Wang, Harrisen Scells, Bevan Koopman, and Guido Zuccon. Can chatgpt write a good boolean query for systematic review literature search? arXiv preprint arXiv:2302.03495, 2023.
  • Wei et al. (2021) Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  • Wendel (1962) James G Wendel. A problem in geometric probability. Mathematica Scandinavica, 11(1):109–111, 1962.
  • Weston & Sukhbaatar (2023) Jason Weston and Sainbayar Sukhbaatar. System 2 attention (is something you might need too). arXiv preprint arXiv:2311.11829, 2023.
  • Wu et al. (2023) Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. arXiv preprint arXiv:2307.02477, 2023.
  • Zeng et al. (2023) Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen. Evaluating large language models at evaluating instruction following. arXiv preprint arXiv:2310.07641, 2023.
  • Zhou et al. (2023) Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023.
  • Ziegler et al. (2019) Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
  • Zou et al. (2023) Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.

Appendix A Sketch of the Theory in Section 4.3

A.1 Setting One: Agent Utterances

In linear algebra, a cone is a subset of a vector space that is closed under positive scalar multiplication. In other words, C𝐶Citalic_C is a cone if xC𝑥𝐶x\in Citalic_x ∈ italic_C implies sxC𝑠𝑥𝐶sx\in Citalic_s italic_x ∈ italic_C for every positive scalar s𝑠sitalic_s. Moreover, C𝐶Citalic_C is called a convex cone if αx+βyC𝛼𝑥𝛽𝑦𝐶\alpha x+\beta y\in Citalic_α italic_x + italic_β italic_y ∈ italic_C for any positive scalars α𝛼\alphaitalic_α and β𝛽\betaitalic_β, and any x,yC𝑥𝑦𝐶x,y\in Citalic_x , italic_y ∈ italic_C.

The dimension of a cone is the dimension of the vector space spanned by the elements of the cone. For convenience, we define two new notions related to low dimensional cones in the space Dsuperscript𝐷\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. Given any d𝑑ditalic_d-dimensional convex cone CD𝐶superscript𝐷C\subset\mathbb{R}^{D}italic_C ⊂ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT (1dD1𝑑𝐷1\leq d\leq D1 ≤ italic_d ≤ italic_D), for ϵ(0,1)italic-ϵ01\epsilon\in(0,1)italic_ϵ ∈ ( 0 , 1 ) we define the corresponding ϵitalic-ϵ\epsilonitalic_ϵ-approximate d𝑑ditalic_d-dimensional cone as

Cϵ:={wCspan(C)D:w=u+v\displaystyle C^{\epsilon}:=\{w\in C\oplus\mathrm{span}(C)^{\bot}\subset% \mathbb{R}^{D}:w=u+vitalic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT := { italic_w ∈ italic_C ⊕ roman_span ( italic_C ) start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT ⊂ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT : italic_w = italic_u + italic_v
 for some uC,vspan(C)Dd,vϵw}.\displaystyle\quad\text{ for some }u\in C,v\in\mathrm{span}(C)^{\bot}\cong% \mathbb{R}^{D-d},\lVert v\rVert\leq\epsilon\lVert w\rVert\}.for some italic_u ∈ italic_C , italic_v ∈ roman_span ( italic_C ) start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT ≅ blackboard_R start_POSTSUPERSCRIPT italic_D - italic_d end_POSTSUPERSCRIPT , ∥ italic_v ∥ ≤ italic_ϵ ∥ italic_w ∥ } .

Given some c𝕊D1𝑐superscript𝕊𝐷1c\in\mathbb{S}^{D-1}italic_c ∈ blackboard_S start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT and θ(0,π/2)𝜃0𝜋2\theta\in(0,\pi/2)italic_θ ∈ ( 0 , italic_π / 2 ), a d𝑑ditalic_d-dimensional spherical cone is the set defined by

Pd[c,θ]:={uUD:Ud,c,uucosθ}.assignsuperscript𝑃𝑑𝑐𝜃conditional-set𝑢𝑈superscript𝐷formulae-sequence𝑈superscript𝑑𝑐𝑢delimited-∥∥𝑢𝜃P^{d}[c,\theta]:=\{u\in U\subset\mathbb{R}^{D}:U\cong\mathbb{R}^{d},\langle c,% u\rangle\geq\lVert u\rVert\cos\theta\}.italic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT [ italic_c , italic_θ ] := { italic_u ∈ italic_U ⊂ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT : italic_U ≅ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , ⟨ italic_c , italic_u ⟩ ≥ ∥ italic_u ∥ roman_cos italic_θ } .
Theorem A.1.

Assume that the token embeddings of the system prompt given by h1,,h|sB|subscript1subscriptsubscript𝑠𝐵h_{1},\ldots,h_{|s_{B}|}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | end_POSTSUBSCRIPT lie in the d𝑑ditalic_d-dimensional approximate cone Cϵsuperscript𝐶italic-ϵC^{\epsilon}italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT, and that any output-value matrix Wovl,m=Wol,mWvl,mD×Dsuperscriptsubscript𝑊𝑜𝑣𝑙𝑚superscriptsubscript𝑊𝑜𝑙𝑚superscriptsubscript𝑊𝑣𝑙𝑚superscript𝐷𝐷W_{ov}^{l,m}=W_{o}^{l,m}W_{v}^{l,m}\in\mathbb{R}^{D\times D}italic_W start_POSTSUBSCRIPT italic_o italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_m end_POSTSUPERSCRIPT = italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_m end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_m end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_D end_POSTSUPERSCRIPT satisfy that Wovl,muCϵsuperscriptsubscript𝑊𝑜𝑣𝑙𝑚𝑢superscript𝐶italic-ϵW_{ov}^{l,m}u\in C^{\epsilon}italic_W start_POSTSUBSCRIPT italic_o italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_m end_POSTSUPERSCRIPT italic_u ∈ italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT for any uCϵ𝑢superscript𝐶italic-ϵu\in C^{\epsilon}italic_u ∈ italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT. Then all proceeding tokens generated by our simplified transformer lie in the convex hull of Cϵsuperscript𝐶italic-ϵC^{\epsilon}italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT. In particular, if Cϵsuperscript𝐶italic-ϵC^{\epsilon}italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT is contained in some spherical cone Pd[c,θ]superscript𝑃𝑑𝑐𝜃P^{d}[c,\theta]italic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT [ italic_c , italic_θ ] , then all generated tokens lie in the ϵ~~italic-ϵ\tilde{\epsilon}over~ start_ARG italic_ϵ end_ARG-approximate cone Cϵ~superscript𝐶~italic-ϵC^{\tilde{\epsilon}}italic_C start_POSTSUPERSCRIPT over~ start_ARG italic_ϵ end_ARG end_POSTSUPERSCRIPT where ϵ~=ϵ/ϵ2+cos2θ(1ϵ2)~italic-ϵitalic-ϵsuperscriptitalic-ϵ2superscript2𝜃1superscriptitalic-ϵ2\tilde{\epsilon}=\epsilon/\sqrt{\epsilon^{2}+\cos^{2}\theta(1-\epsilon^{2})}over~ start_ARG italic_ϵ end_ARG = italic_ϵ / square-root start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ ( 1 - italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG.

For the initial tokens, θ𝜃\thetaitalic_θ indicates how concentrated their embeddings are, and d𝑑ditalic_d is roughly the intrinsic dimension of these embeddings. Note that d|sB|𝑑subscript𝑠𝐵d\leq|s_{B}|italic_d ≤ | italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | and the number of tokens in the system prompt |sB|subscript𝑠𝐵|s_{B}|| italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | is usually much smaller than the dimensions of hidden space D𝐷Ditalic_D, which is 8192819281928192 in the case of LLaMA2-70B-chat. Thus, the assumption that initial embeddings occupy a low-dimensional cone is reasonable.

Theorem A.1 shows the convex cone for token embeddings remains stable during the generating process if there is no user input, which leads to the plateau within an utterance.

A.2 Setting Two: User Utterances

Again we assume that the system tokens h1,,hsBsubscript1subscriptdelimited-∥∥subscript𝑠𝐵h_{1},\ldots,h_{\lVert s_{B}\rVert}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT ∥ italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT are from some C0ϵsuperscriptsubscript𝐶0italic-ϵC_{0}^{\epsilon}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT, and let Cnsubscript𝐶𝑛C_{n}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be the smallest convex cone containing C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and user tokens {h|sB|+i}i=1nsuperscriptsubscriptsubscriptsubscript𝑠𝐵𝑖𝑖1𝑛\{h_{|s_{B}|+i}\}_{i=1}^{n}{ italic_h start_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | + italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Then the expansion C0C1Cnsubscript𝐶0subscript𝐶1subscript𝐶𝑛C_{0}\subset C_{1}\subset\cdots\subset C_{n}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊂ italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊂ ⋯ ⊂ italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT reflects the attention decay under the influence of user utterances. To get some intuition on the expanding process, we show the following:

Proposition A.2.

If user tokens are drawn i.i.d. uniformly from 𝕊D1superscript𝕊𝐷1\mathbb{S}^{D-1}blackboard_S start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT, then with probability 1η1𝜂1-\eta1 - italic_η after n4D+2log1η𝑛4𝐷21𝜂n\geq 4D+2\log\frac{1}{\eta}italic_n ≥ 4 italic_D + 2 roman_log divide start_ARG 1 end_ARG start_ARG italic_η end_ARG user tokens Cnsubscript𝐶𝑛C_{n}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT expands to the whole space Dsuperscript𝐷\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT.

Proposition A.2 suggests that when user utterances are inserted, the size of the convex cone for token embeddings will grow significantly, which gives rise to the drop of π(t)𝜋𝑡\pi(t)italic_π ( italic_t ) across utterances. To further quantify the expansion of convex cones, we can consider the spherical measure σD1subscript𝜎𝐷1\sigma_{D-1}italic_σ start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT, which is the Borel measure on the (D1)𝐷1(D-1)( italic_D - 1 )-sphere such that σD1(𝕊D1)=1subscript𝜎𝐷1superscript𝕊𝐷11\sigma_{D-1}(\mathbb{S}^{D-1})=1italic_σ start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT ( blackboard_S start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT ) = 1. For any ϵitalic-ϵ\epsilonitalic_ϵ-approximate convex cone Cϵsuperscript𝐶italic-ϵC^{\epsilon}italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT, define the volume of Cϵsuperscript𝐶italic-ϵC^{\epsilon}italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT by

μ(Cϵ):=σD1(Cϵ𝕊D1).assign𝜇superscript𝐶italic-ϵsubscript𝜎𝐷1superscript𝐶italic-ϵsuperscript𝕊𝐷1\mu(C^{\epsilon}):=\sigma_{D-1}(C^{\epsilon}\cap\mathbb{S}^{D-1}).italic_μ ( italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ) := italic_σ start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT ( italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ∩ blackboard_S start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT ) .

Then intuitively μ(C0ϵ)/μ(Cnϵ)𝜇superscriptsubscript𝐶0italic-ϵ𝜇superscriptsubscript𝐶𝑛italic-ϵ\mu(C_{0}^{\epsilon})/\mu(C_{n}^{\epsilon})italic_μ ( italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ) / italic_μ ( italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ) indicates the degree to which the current tokens in Cnϵsuperscriptsubscript𝐶𝑛italic-ϵC_{n}^{\epsilon}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT align with the system tokens in C0ϵsuperscriptsubscript𝐶0italic-ϵC_{0}^{\epsilon}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT, similar to the quantity π(t)𝜋𝑡\pi(t)italic_π ( italic_t ) defined in the previous section.

In real applications, user messages are not i.i.d. uniform variables from 𝕊D1superscript𝕊𝐷1\mathbb{S}^{D-1}blackboard_S start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT. However, there usually exists an evident proportion of user tokens distinct from the system tokens. They could probably be tokens unique in the specific topics that the user inquires about or, more typically, tokens from a new language. It could also happen that the user is attacking the LM by sending adversarial tokens (Zou et al., 2023). The following proposition quantifies how attention decays in terms of μ(C0ϵ)/μ(Cnϵ)𝜇superscriptsubscript𝐶0italic-ϵ𝜇superscriptsubscript𝐶𝑛italic-ϵ\mu(C_{0}^{\epsilon})/\mu(C_{n}^{\epsilon})italic_μ ( italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ) / italic_μ ( italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ) as such embedding dimension increases.

Proposition A.3.

Suppose C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-dimensional convex cone contained in some d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-dimensional spherical cone Pd1[c1,ψ1]superscript𝑃subscript𝑑1subscript𝑐1subscript𝜓1P^{d_{1}}[c_{1},\psi_{1}]italic_P start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] while Cnsubscript𝐶𝑛C_{n}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is a d2subscript𝑑2d_{2}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-dimensional convex cone containing a d2subscript𝑑2d_{2}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-dimensional spherical cone Pd2[c2,ψ2]superscript𝑃subscript𝑑2subscript𝑐2subscript𝜓2P^{d_{2}}[c_{2},\psi_{2}]italic_P start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ]. Then we have

μ(C0ϵ)μ(Cnϵ)ϵd2d1.less-than-or-similar-to𝜇superscriptsubscript𝐶0italic-ϵ𝜇superscriptsubscript𝐶𝑛italic-ϵsuperscriptitalic-ϵsubscript𝑑2subscript𝑑1\frac{\mu(C_{0}^{\epsilon})}{\mu(C_{n}^{\epsilon})}\lesssim\epsilon^{d_{2}-d_{% 1}}.divide start_ARG italic_μ ( italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_μ ( italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ) end_ARG ≲ italic_ϵ start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .

The geometric perspective we proposed provides a concrete explanation of why inserting user prompts will cause attention decay while autoregressive generation from the model will almost have no harm. However, one limitation here is that we have only compared the cone structure without tracking the distribution of token embeddings within the cones. In particular, if we force the majority of tokens generated from Cnϵsuperscriptsubscript𝐶𝑛italic-ϵC_{n}^{\epsilon}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT to be contained or close to C0ϵsuperscriptsubscript𝐶0italic-ϵC_{0}^{\epsilon}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT, the issue of attention decay could possibly be mitigated, which motivates our method in the proceeding section.

Appendix B Proofs for Appendix A

We start by making simplifications to the model and token-generating process. First, the model is simplified by omitting the MLP and layer norms as in Equation 1. For the token-generating process, the embedding of the next token ht+1subscript𝑡1h_{t+1}italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is close to htLsuperscriptsubscript𝑡𝐿h_{t}^{L}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT among all tokens in the vocabulary in Equation 2. Thus, for convenience we directly put ht+1:=htL/htLassignsubscript𝑡1superscriptsubscript𝑡𝐿delimited-∥∥superscriptsubscript𝑡𝐿h_{t+1}:=h_{t}^{L}/\lVert h_{t}^{L}\rVertitalic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT := italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT / ∥ italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∥ in our simplified model, meaning that all embeddings lie on the unit hypersphere 𝕊D1:={vD:v=1}assignsuperscript𝕊𝐷1conditional-set𝑣superscript𝐷delimited-∥∥𝑣1\mathbb{S}^{D-1}:=\{v\in\mathbb{R}^{D}:\lVert v\rVert=1\}blackboard_S start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT := { italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT : ∥ italic_v ∥ = 1 }.

Proof of Theorem A.1.

Let Cϵ¯¯superscript𝐶italic-ϵ\overline{C^{\epsilon}}over¯ start_ARG italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT end_ARG be the convex hull of Cϵsuperscript𝐶italic-ϵC^{\epsilon}italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT. The Cϵ¯¯superscript𝐶italic-ϵ\overline{C^{\epsilon}}over¯ start_ARG italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT end_ARG is a convex cone containing Cϵsuperscript𝐶italic-ϵC^{\epsilon}italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT. Theorem A.1 can be proven in two steps.

Step I. We establish that htCϵ¯subscript𝑡¯superscript𝐶italic-ϵh_{t}\in\overline{C^{\epsilon}}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ over¯ start_ARG italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT end_ARG by induction. h1,,ht0subscript1subscriptsubscript𝑡0h_{1},\ldots,h_{t_{0}}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT already satisfy the claim by assumption. Supposing that h1,,htCϵ¯subscript1subscript𝑡¯superscript𝐶italic-ϵh_{1},\ldots,h_{t}\in\overline{C^{\epsilon}}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ over¯ start_ARG italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT end_ARG (tt0𝑡subscript𝑡0t\geq t_{0}italic_t ≥ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT), we show that ht+1subscript𝑡1h_{t+1}italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is also in Cϵ¯¯superscript𝐶italic-ϵ\overline{C^{\epsilon}}over¯ start_ARG italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT end_ARG. Here we look into hjlsuperscriptsubscript𝑗𝑙h_{j}^{l}italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT (j=1,,t𝑗1𝑡j=1,\ldots,titalic_j = 1 , … , italic_t, l=1,,L𝑙1𝐿l=1,\ldots,Litalic_l = 1 , … , italic_L) in the process of generating ht+1subscript𝑡1h_{t+1}italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. We perform induction on l𝑙litalic_l. For l=0𝑙0l=0italic_l = 0, we have hjl=hjCϵ¯superscriptsubscript𝑗𝑙subscript𝑗¯superscript𝐶italic-ϵh_{j}^{l}=h_{j}\in\overline{C^{\epsilon}}italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ over¯ start_ARG italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT end_ARG. Supposing that hjlCϵ¯superscriptsubscript𝑗𝑙¯superscript𝐶italic-ϵh_{j}^{l}\in\overline{C^{\epsilon}}italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ over¯ start_ARG italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT end_ARG for j=1,,t𝑗1𝑡j=1,\ldots,titalic_j = 1 , … , italic_t, it suffices to prove that hjl+1Cϵ¯superscriptsubscript𝑗𝑙1¯superscript𝐶italic-ϵh_{j}^{l+1}\in\overline{C^{\epsilon}}italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT ∈ over¯ start_ARG italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT end_ARG.

By induction hypothesis that hjlCϵ¯superscriptsubscript𝑗𝑙¯superscript𝐶italic-ϵh_{j}^{l}\in\overline{C^{\epsilon}}italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ over¯ start_ARG italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT end_ARG (j=1,,t𝑗1𝑡j=1,\ldots,titalic_j = 1 , … , italic_t) we can find kj+subscript𝑘𝑗superscriptk_{j}\in\mathbb{N}^{+}italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, xj,1,,xj,kjCϵsubscript𝑥𝑗1subscript𝑥𝑗subscript𝑘𝑗superscript𝐶italic-ϵx_{j,1},\ldots,x_{j,k_{j}}\in C^{\epsilon}italic_x start_POSTSUBSCRIPT italic_j , 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_j , italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT, and wj,1,,wj,kj>0subscript𝑤𝑗1subscript𝑤𝑗subscript𝑘𝑗0w_{j,1},\ldots,w_{j,k_{j}}>0italic_w start_POSTSUBSCRIPT italic_j , 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_j , italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT > 0 for j=1,,t𝑗1𝑡j=1,\ldots,titalic_j = 1 , … , italic_t such that

hjl=i=1kjwj,ixj,i.superscriptsubscript𝑗𝑙superscriptsubscript𝑖1subscript𝑘𝑗subscript𝑤𝑗𝑖subscript𝑥𝑗𝑖h_{j}^{l}=\sum_{i=1}^{k_{j}}w_{j,i}x_{j,i}.italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT .

Thus, by Equation 1 we have

hjl+1superscriptsubscript𝑗𝑙1\displaystyle h_{j}^{l+1}italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT =hjl+m=1HWol+1,mAttl+1,m(h1l,,hjl)absentsuperscriptsubscript𝑗𝑙superscriptsubscript𝑚1𝐻superscriptsubscript𝑊𝑜𝑙1𝑚superscriptAtt𝑙1𝑚superscriptsubscript1𝑙superscriptsubscript𝑗𝑙\displaystyle=h_{j}^{l}+\sum_{m=1}^{H}W_{o}^{l+1,m}\mathrm{Att}^{l+1,m}(h_{1}^% {l},\ldots,h_{j}^{l})= italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 , italic_m end_POSTSUPERSCRIPT roman_Att start_POSTSUPERSCRIPT italic_l + 1 , italic_m end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT )
=hjl+m=1Hs=1jαj,sl+1,mWol+1,mWvl+1,mhslabsentsuperscriptsubscript𝑗𝑙superscriptsubscript𝑚1𝐻superscriptsubscript𝑠1𝑗superscriptsubscript𝛼𝑗𝑠𝑙1𝑚superscriptsubscript𝑊𝑜𝑙1𝑚superscriptsubscript𝑊𝑣𝑙1𝑚superscriptsubscript𝑠𝑙\displaystyle=h_{j}^{l}+\sum_{m=1}^{H}\sum_{s=1}^{j}\alpha_{j,s}^{l+1,m}W_{o}^% {l+1,m}W_{v}^{l+1,m}h_{s}^{l}= italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 , italic_m end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 , italic_m end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 , italic_m end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT
=hjl+m=1Hs=1ji=1ksαj,sl+1,mws,iWol+1,mWvl+1,mxs,i.absentsuperscriptsubscript𝑗𝑙superscriptsubscript𝑚1𝐻superscriptsubscript𝑠1𝑗superscriptsubscript𝑖1subscript𝑘𝑠superscriptsubscript𝛼𝑗𝑠𝑙1𝑚subscript𝑤𝑠𝑖superscriptsubscript𝑊𝑜𝑙1𝑚superscriptsubscript𝑊𝑣𝑙1𝑚subscript𝑥𝑠𝑖\displaystyle=h_{j}^{l}+\sum_{m=1}^{H}\sum_{s=1}^{j}\sum_{i=1}^{k_{s}}\alpha_{% j,s}^{l+1,m}w_{s,i}W_{o}^{l+1,m}W_{v}^{l+1,m}x_{s,i}.= italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 , italic_m end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 , italic_m end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 , italic_m end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT .

Note that αj,sl+1,m>0superscriptsubscript𝛼𝑗𝑠𝑙1𝑚0\alpha_{j,s}^{l+1,m}>0italic_α start_POSTSUBSCRIPT italic_j , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 , italic_m end_POSTSUPERSCRIPT > 0 since it is calculated from softmax and by assumption we have Wol+1,mWvl+1,mxi,sCϵsuperscriptsubscript𝑊𝑜𝑙1𝑚superscriptsubscript𝑊𝑣𝑙1𝑚subscript𝑥𝑖𝑠superscript𝐶italic-ϵW_{o}^{l+1,m}W_{v}^{l+1,m}x_{i,s}\in C^{\epsilon}italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 , italic_m end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 , italic_m end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i , italic_s end_POSTSUBSCRIPT ∈ italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT as xs,iCϵsubscript𝑥𝑠𝑖superscript𝐶italic-ϵx_{s,i}\in C^{\epsilon}italic_x start_POSTSUBSCRIPT italic_s , italic_i end_POSTSUBSCRIPT ∈ italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT. Thus, we conclude that hjl+1Cϵ¯superscriptsubscript𝑗𝑙1¯superscript𝐶italic-ϵh_{j}^{l+1}\in\overline{C^{\epsilon}}italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT ∈ over¯ start_ARG italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT end_ARG. By induction we know for l=1,,L𝑙1𝐿l=1,\ldots,Litalic_l = 1 , … , italic_L and j=1,,t𝑗1𝑡j=1,\ldots,titalic_j = 1 , … , italic_t we have hjlCϵ¯superscriptsubscript𝑗𝑙¯superscript𝐶italic-ϵh_{j}^{l}\in\overline{C^{\epsilon}}italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ over¯ start_ARG italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT end_ARG. Thus, ht+1=htL/htLCϵ¯subscript𝑡1superscriptsubscript𝑡𝐿delimited-∥∥superscriptsubscript𝑡𝐿¯superscript𝐶italic-ϵh_{t+1}=h_{t}^{L}/\lVert h_{t}^{L}\rVert\in\overline{C^{\epsilon}}italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT / ∥ italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∥ ∈ over¯ start_ARG italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT end_ARG holds. And by induction again we conclude that htCϵ¯subscript𝑡¯superscript𝐶italic-ϵh_{t}\in\overline{C^{\epsilon}}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ over¯ start_ARG italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT end_ARG for all t1𝑡1t\geq 1italic_t ≥ 1.

Step II. Let γ=cosθ𝛾𝜃\gamma=\cos\thetaitalic_γ = roman_cos italic_θ. We prove that Cϵ¯Cϵ~¯superscript𝐶italic-ϵsuperscript𝐶~italic-ϵ\overline{C^{\epsilon}}\subset C^{\tilde{\epsilon}}over¯ start_ARG italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT end_ARG ⊂ italic_C start_POSTSUPERSCRIPT over~ start_ARG italic_ϵ end_ARG end_POSTSUPERSCRIPT where ϵ~=ϵ/ϵ2+γ2(1ϵ2)~italic-ϵitalic-ϵsuperscriptitalic-ϵ2superscript𝛾21superscriptitalic-ϵ2\tilde{\epsilon}=\epsilon/\sqrt{\epsilon^{2}+\gamma^{2}(1-\epsilon^{2})}over~ start_ARG italic_ϵ end_ARG = italic_ϵ / square-root start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG. For any yCϵ¯𝑦¯superscript𝐶italic-ϵy\in\overline{C^{\epsilon}}italic_y ∈ over¯ start_ARG italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT end_ARG, there exists k+𝑘superscriptk\in\mathbb{N}^{+}italic_k ∈ blackboard_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, x1,,xkCϵsubscript𝑥1subscript𝑥𝑘superscript𝐶italic-ϵx_{1},\ldots,x_{k}\in C^{\epsilon}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT, and w1,,wk>0subscript𝑤1subscript𝑤𝑘0w_{1},\ldots,w_{k}>0italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > 0 such that y=i=1kwixi𝑦superscriptsubscript𝑖1𝑘subscript𝑤𝑖subscript𝑥𝑖y=\sum_{i=1}^{k}w_{i}x_{i}italic_y = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. By definition of Cϵsuperscript𝐶italic-ϵC^{\epsilon}italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT, xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be written as xi=ui+visubscript𝑥𝑖subscript𝑢𝑖subscript𝑣𝑖x_{i}=u_{i}+v_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT where uiCsubscript𝑢𝑖𝐶u_{i}\in Citalic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_C and vispan(C)subscript𝑣𝑖spansuperscript𝐶bottomv_{i}\in\mathrm{span}(C)^{\bot}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_span ( italic_C ) start_POSTSUPERSCRIPT ⊥ end_POSTSUPERSCRIPT and viϵxidelimited-∥∥subscript𝑣𝑖italic-ϵdelimited-∥∥subscript𝑥𝑖\lVert v_{i}\rVert\leq\epsilon\lVert x_{i}\rVert∥ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ≤ italic_ϵ ∥ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥. By definition of Pd[c,θ]superscript𝑃𝑑𝑐𝜃P^{d}[c,\theta]italic_P start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT [ italic_c , italic_θ ] we have c,uiγui𝑐subscript𝑢𝑖𝛾delimited-∥∥subscript𝑢𝑖\langle c,u_{i}\rangle\geq\gamma\lVert u_{i}\rVert⟨ italic_c , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ≥ italic_γ ∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ for all i=1,,k𝑖1𝑘i=1,\ldots,kitalic_i = 1 , … , italic_k. Let u~i:=c,uicassignsubscript~𝑢𝑖𝑐subscript𝑢𝑖𝑐\tilde{u}_{i}:=\langle c,u_{i}\rangle cover~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := ⟨ italic_c , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ italic_c. Then u~i,uiu~i=0subscript~𝑢𝑖subscript𝑢𝑖subscript~𝑢𝑖0\langle\tilde{u}_{i},u_{i}-\tilde{u}_{i}\rangle=0⟨ over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ = 0 and hence i=1kwiu~i,i=1kwi(uiu~i)=0superscriptsubscript𝑖1𝑘subscript𝑤𝑖subscript~𝑢𝑖superscriptsubscript𝑖1𝑘subscript𝑤𝑖subscript𝑢𝑖subscript~𝑢𝑖0\langle\sum_{i=1}^{k}w_{i}\tilde{u}_{i},\sum_{i=1}^{k}w_{i}(u_{i}-\tilde{u}_{i% })\rangle=0⟨ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⟩ = 0. Therefore, we have

i=1kwiuii=1kwiu~i=i=1kc,i=1kwiuiγi=1kwiui.delimited-∥∥superscriptsubscript𝑖1𝑘subscript𝑤𝑖subscript𝑢𝑖delimited-∥∥superscriptsubscript𝑖1𝑘subscript𝑤𝑖subscript~𝑢𝑖superscriptsubscript𝑖1𝑘𝑐superscriptsubscript𝑖1𝑘subscript𝑤𝑖subscript𝑢𝑖𝛾superscriptsubscript𝑖1𝑘subscript𝑤𝑖delimited-∥∥subscript𝑢𝑖\Bigl{\lVert}\sum_{i=1}^{k}w_{i}u_{i}\Bigr{\rVert}\geq\Bigl{\lVert}\sum_{i=1}^% {k}w_{i}\tilde{u}_{i}\Bigr{\rVert}=\sum_{i=1}^{k}\Bigl{\langle}c,\sum_{i=1}^{k% }w_{i}u_{i}\Bigr{\rangle}\geq\gamma\sum_{i=1}^{k}w_{i}\lVert u_{i}\rVert.∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ≥ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over~ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⟨ italic_c , ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ≥ italic_γ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ .

On the other hand, we know

i=1kwivii=1kwiviϵ1ϵ2i=1kwiui.delimited-∥∥superscriptsubscript𝑖1𝑘subscript𝑤𝑖subscript𝑣𝑖superscriptsubscript𝑖1𝑘subscript𝑤𝑖delimited-∥∥subscript𝑣𝑖italic-ϵ1superscriptitalic-ϵ2superscriptsubscript𝑖1𝑘subscript𝑤𝑖delimited-∥∥subscript𝑢𝑖\Bigl{\lVert}\sum_{i=1}^{k}w_{i}v_{i}\Bigr{\rVert}\leq\sum_{i=1}^{k}w_{i}% \lVert v_{i}\rVert\leq\frac{\epsilon}{\sqrt{1-\epsilon^{2}}}\sum_{i=1}^{k}w_{i% }\lVert u_{i}\rVert.∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ≤ divide start_ARG italic_ϵ end_ARG start_ARG square-root start_ARG 1 - italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ .

Therefore, it holds that

i=1kwiuiγ1ϵ2ϵi=1kwivi,delimited-∥∥superscriptsubscript𝑖1𝑘subscript𝑤𝑖subscript𝑢𝑖𝛾1superscriptitalic-ϵ2italic-ϵdelimited-∥∥superscriptsubscript𝑖1𝑘subscript𝑤𝑖subscript𝑣𝑖\Bigl{\lVert}\sum_{i=1}^{k}w_{i}u_{i}\Bigr{\rVert}\geq\frac{\gamma\sqrt{1-% \epsilon^{2}}}{\epsilon}\Bigl{\lVert}\sum_{i=1}^{k}w_{i}v_{i}\Bigr{\rVert},∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ≥ divide start_ARG italic_γ square-root start_ARG 1 - italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG italic_ϵ end_ARG ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ,

which implies that

i=1kwiviϵϵ2+γ2(1ϵ2)i=1kwixi.delimited-∥∥superscriptsubscript𝑖1𝑘subscript𝑤𝑖subscript𝑣𝑖italic-ϵsuperscriptitalic-ϵ2superscript𝛾21superscriptitalic-ϵ2delimited-∥∥superscriptsubscript𝑖1𝑘subscript𝑤𝑖subscript𝑥𝑖\Bigl{\lVert}\sum_{i=1}^{k}w_{i}v_{i}\Bigr{\rVert}\geq\frac{\epsilon}{\sqrt{% \epsilon^{2}+\gamma^{2}(1-\epsilon^{2})}}\Bigl{\lVert}\sum_{i=1}^{k}w_{i}x_{i}% \Bigr{\rVert}.∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ≥ divide start_ARG italic_ϵ end_ARG start_ARG square-root start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG end_ARG ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ .

Thus, we conclude that Cϵ¯Cϵ~¯superscript𝐶italic-ϵsuperscript𝐶~italic-ϵ\overline{C^{\epsilon}}\subset C^{\tilde{\epsilon}}over¯ start_ARG italic_C start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT end_ARG ⊂ italic_C start_POSTSUPERSCRIPT over~ start_ARG italic_ϵ end_ARG end_POSTSUPERSCRIPT. ∎

To prove Proposition A.2 we need the following lemma.

Lemma B.1 (Wendel, 1962).

Let N𝑁Nitalic_N points be scattered uniformly at random on 𝕊mm+1superscript𝕊𝑚superscript𝑚1\mathbb{S}^{m}\subset\mathbb{R}^{m+1}blackboard_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ⊂ blackboard_R start_POSTSUPERSCRIPT italic_m + 1 end_POSTSUPERSCRIPT. Then the probability that all points lie on some hemisphere is given by

am,N=2N+1k=0m(N1k).subscript𝑎𝑚𝑁superscript2𝑁1superscriptsubscript𝑘0𝑚binomial𝑁1𝑘a_{m,N}=2^{-N+1}\sum_{k=0}^{m}\binom{N-1}{k}.italic_a start_POSTSUBSCRIPT italic_m , italic_N end_POSTSUBSCRIPT = 2 start_POSTSUPERSCRIPT - italic_N + 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_N - 1 end_ARG start_ARG italic_k end_ARG ) .
Proof of Proposition A.2.

If there is no hemisphere containing ht0+1,,ht0+nsubscriptsubscript𝑡01subscriptsubscript𝑡0𝑛h_{t_{0}+1},\ldots,h_{t_{0}+n}italic_h start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_n end_POSTSUBSCRIPT, then the origin lies in Cnsubscript𝐶𝑛C_{n}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and is not on the boundary, meaning that Cn=Dsubscript𝐶𝑛superscript𝐷C_{n}=\mathbb{R}^{D}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. Thus, we only need to show that for n4D+log1η𝑛4𝐷1𝜂n\geq 4D+\log\frac{1}{\eta}italic_n ≥ 4 italic_D + roman_log divide start_ARG 1 end_ARG start_ARG italic_η end_ARG, it holds that aD,nηsubscript𝑎𝐷𝑛𝜂a_{D,n}\leq\etaitalic_a start_POSTSUBSCRIPT italic_D , italic_n end_POSTSUBSCRIPT ≤ italic_η. Since

2ni=0D(ni)2ni=0Dnii!=2ni=0DD!i!(nD)i2n(enD)D.superscript2𝑛superscriptsubscript𝑖0𝐷binomial𝑛𝑖superscript2𝑛superscriptsubscript𝑖0𝐷superscript𝑛𝑖𝑖superscript2𝑛superscriptsubscript𝑖0𝐷𝐷𝑖superscript𝑛𝐷𝑖superscript2𝑛superscript𝑒𝑛𝐷𝐷2^{-n}\sum_{i=0}^{D}\binom{n}{i}\leq 2^{-n}\sum_{i=0}^{D}\frac{n^{i}}{i!}=2^{-% n}\sum_{i=0}^{D}\frac{D!}{i!}\Bigl{(}\frac{n}{D}\Bigr{)}^{i}\leq 2^{-n}\left(% \frac{en}{D}\right)^{D}.2 start_POSTSUPERSCRIPT - italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_n end_ARG start_ARG italic_i end_ARG ) ≤ 2 start_POSTSUPERSCRIPT - italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG italic_i ! end_ARG = 2 start_POSTSUPERSCRIPT - italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT divide start_ARG italic_D ! end_ARG start_ARG italic_i ! end_ARG ( divide start_ARG italic_n end_ARG start_ARG italic_D end_ARG ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≤ 2 start_POSTSUPERSCRIPT - italic_n end_POSTSUPERSCRIPT ( divide start_ARG italic_e italic_n end_ARG start_ARG italic_D end_ARG ) start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT .

It suffices to prove that 2n(enD)D<ηsuperscript2𝑛superscript𝑒𝑛𝐷𝐷𝜂2^{-n}\left(\frac{en}{D}\right)^{D}<\eta2 start_POSTSUPERSCRIPT - italic_n end_POSTSUPERSCRIPT ( divide start_ARG italic_e italic_n end_ARG start_ARG italic_D end_ARG ) start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT < italic_η. For convenience let α:=4+2Dlog1ηnDassign𝛼42𝐷1𝜂𝑛𝐷\alpha:=4+\frac{2}{D}\log\frac{1}{\eta}\leq\frac{n}{D}italic_α := 4 + divide start_ARG 2 end_ARG start_ARG italic_D end_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_η end_ARG ≤ divide start_ARG italic_n end_ARG start_ARG italic_D end_ARG. Then we can check that

(log212)eα/2>(1η)1/D.212superscript𝑒𝛼2superscript1𝜂1𝐷\bigl{(}\log 2-\frac{1}{2}\bigr{)}e^{\alpha/2}>\bigl{(}\frac{1}{\eta}\bigr{)}^% {1/D}.( roman_log 2 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) italic_e start_POSTSUPERSCRIPT italic_α / 2 end_POSTSUPERSCRIPT > ( divide start_ARG 1 end_ARG start_ARG italic_η end_ARG ) start_POSTSUPERSCRIPT 1 / italic_D end_POSTSUPERSCRIPT .

Note that

eα(log212)1α(log212),superscript𝑒𝛼2121𝛼212e^{\alpha(\log 2-\frac{1}{2})-1}\geq\alpha\bigl{(}\log 2-\frac{1}{2}\bigr{)},italic_e start_POSTSUPERSCRIPT italic_α ( roman_log 2 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) - 1 end_POSTSUPERSCRIPT ≥ italic_α ( roman_log 2 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ,

which is equivalent to

eαeα(log212)log212=2αeα/2(log212).𝑒𝛼superscript𝑒𝛼212212superscript2𝛼superscript𝑒𝛼2212e\alpha\leq\frac{e^{\alpha(\log 2-\frac{1}{2})}}{\log 2-\frac{1}{2}}=\frac{2^{% \alpha}}{e^{\alpha/2}\bigl{(}\log 2-\frac{1}{2}\bigr{)}}.italic_e italic_α ≤ divide start_ARG italic_e start_POSTSUPERSCRIPT italic_α ( roman_log 2 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) end_POSTSUPERSCRIPT end_ARG start_ARG roman_log 2 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_ARG = divide start_ARG 2 start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_α / 2 end_POSTSUPERSCRIPT ( roman_log 2 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) end_ARG .

Thus, we have

2n(enD)D(eα)D2αD1(log212)DeαD/2<η.superscript2𝑛superscript𝑒𝑛𝐷𝐷superscript𝑒𝛼𝐷superscript2𝛼𝐷1superscript212𝐷superscript𝑒𝛼𝐷2𝜂2^{-n}\left(\frac{en}{D}\right)^{D}\leq\frac{(e\alpha)^{D}}{2^{\alpha D}}\leq% \frac{1}{\bigl{(}\log 2-\frac{1}{2}\bigr{)}^{D}e^{\alpha D/2}}<\eta.2 start_POSTSUPERSCRIPT - italic_n end_POSTSUPERSCRIPT ( divide start_ARG italic_e italic_n end_ARG start_ARG italic_D end_ARG ) start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ≤ divide start_ARG ( italic_e italic_α ) start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_α italic_D end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG 1 end_ARG start_ARG ( roman_log 2 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_α italic_D / 2 end_POSTSUPERSCRIPT end_ARG < italic_η .

To show Proposition A.3 we need the following lemma.

Lemma B.2 (Li, 2010).

The spherical measure of the spherical cap Pm+1[c,θ]𝕊msuperscript𝑃𝑚1𝑐𝜃superscript𝕊𝑚P^{m+1}[c,\theta]\cap\mathbb{S}^{m}italic_P start_POSTSUPERSCRIPT italic_m + 1 end_POSTSUPERSCRIPT [ italic_c , italic_θ ] ∩ blackboard_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is given by

σm(Pm+1[c,θ]𝕊m)=0θsinm1xdx20π/2sinm1xdx=Γ(m+12)πΓ(m2)0θsinm1xdx,subscript𝜎𝑚superscript𝑃𝑚1𝑐𝜃superscript𝕊𝑚superscriptsubscript0𝜃superscript𝑚1𝑥𝑑𝑥2superscriptsubscript0𝜋2superscript𝑚1𝑥𝑑𝑥Γ𝑚12𝜋Γ𝑚2superscriptsubscript0𝜃superscript𝑚1𝑥𝑑𝑥\sigma_{m}(P^{m+1}[c,\theta]\cap\mathbb{S}^{m})=\frac{\int_{0}^{\theta}\sin^{m% -1}xdx}{2\int_{0}^{\pi/2}\sin^{m-1}xdx}=\frac{\Gamma(\frac{m+1}{2})}{\sqrt{\pi% }\Gamma(\frac{m}{2})}\int_{0}^{\theta}\sin^{m-1}xdx,italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT italic_m + 1 end_POSTSUPERSCRIPT [ italic_c , italic_θ ] ∩ blackboard_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) = divide start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT roman_sin start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT italic_x italic_d italic_x end_ARG start_ARG 2 ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π / 2 end_POSTSUPERSCRIPT roman_sin start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT italic_x italic_d italic_x end_ARG = divide start_ARG roman_Γ ( divide start_ARG italic_m + 1 end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG square-root start_ARG italic_π end_ARG roman_Γ ( divide start_ARG italic_m end_ARG start_ARG 2 end_ARG ) end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT roman_sin start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT italic_x italic_d italic_x ,

where Γ(x)Γ𝑥\Gamma(x)roman_Γ ( italic_x ) is the Gamma function.

Proof of Proposition A.3.

First we lower bound μ(Cnϵ)𝜇superscriptsubscript𝐶𝑛italic-ϵ\mu(C_{n}^{\epsilon})italic_μ ( italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ) by identifying as many disjoint spherical caps with angle θ:=arcsinϵassign𝜃italic-ϵ\theta:=\arcsin\epsilonitalic_θ := roman_arcsin italic_ϵ as possible and applying Lemma B.2.

Let M𝑀Mitalic_M be the largest number such that there exists a set of points a1,,aMPd2[c2,ψ2θ]𝕊D1subscript𝑎1subscript𝑎𝑀superscript𝑃subscript𝑑2subscript𝑐2subscript𝜓2𝜃superscript𝕊𝐷1a_{1},\ldots,a_{M}\in P^{d_{2}}[c_{2},\psi_{2}-\theta]\cap\mathbb{S}^{D-1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∈ italic_P start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_θ ] ∩ blackboard_S start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT to ensure PD[ai,θ]Pd2[c2,ψ2]superscript𝑃𝐷subscript𝑎𝑖𝜃superscript𝑃subscript𝑑2subscript𝑐2subscript𝜓2P^{D}[a_{i},\theta]\subset P^{d_{2}}[c_{2},\psi_{2}]italic_P start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT [ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ ] ⊂ italic_P start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] (i=1,,M𝑖1𝑀i=1,\ldots,Mitalic_i = 1 , … , italic_M) are disjoint from one another (“disjoint” meaning that the measure of intersection is zero). We claim that {Pd2[ai,2θ]}i=1Msuperscriptsubscriptsuperscript𝑃subscript𝑑2subscript𝑎𝑖2𝜃𝑖1𝑀\bigl{\{}P^{d_{2}}[a_{i},2\theta]\bigr{\}}_{i=1}^{M}{ italic_P start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 2 italic_θ ] } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT is a covering of Pd2[c2,ψ2]superscript𝑃subscript𝑑2subscript𝑐2subscript𝜓2P^{d_{2}}[c_{2},\psi_{2}]italic_P start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ]. Otherwise, choosing a0Pd2[c2,ψ2]𝕊D1iPd2[ai,2θ]subscript𝑎0superscript𝑃subscript𝑑2subscript𝑐2subscript𝜓2superscript𝕊𝐷1subscript𝑖superscript𝑃subscript𝑑2subscript𝑎𝑖2𝜃a_{0}\in P^{d_{2}}[c_{2},\psi_{2}]\cap\mathbb{S}^{D-1}\setminus\bigcup_{i}P^{d% _{2}}[a_{i},2\theta]italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_P start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ∩ blackboard_S start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT ∖ ⋃ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 2 italic_θ ] we can check that PD[a0,θ]superscript𝑃𝐷subscript𝑎0𝜃P^{D}[a_{0},\theta]italic_P start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT [ italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_θ ] does not intersect with any of PD[ai,θ]superscript𝑃𝐷subscript𝑎𝑖𝜃P^{D}[a_{i},\theta]italic_P start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT [ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ ]. Thus, these M+1𝑀1M+1italic_M + 1 spherical caps do not overlap, which contradicts the definition of M𝑀Mitalic_M. Hence Pd2[c2,ψ2]iPd2[ai,2θ]superscript𝑃subscript𝑑2subscript𝑐2subscript𝜓2subscript𝑖superscript𝑃subscript𝑑2subscript𝑎𝑖2𝜃P^{d_{2}}[c_{2},\psi_{2}]\subset\bigcup_{i}P^{d_{2}}[a_{i},2\theta]italic_P start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ⊂ ⋃ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 2 italic_θ ], and by Lemma B.2 we have

Γ(d22)πΓ(d212)0ψ2sind22xdx=σd21(Pd2[c2,ψ2]𝕊D1)Γsubscript𝑑22𝜋Γsubscript𝑑212superscriptsubscript0subscript𝜓2superscriptsubscript𝑑22𝑥𝑑𝑥subscript𝜎subscript𝑑21superscript𝑃subscript𝑑2subscript𝑐2subscript𝜓2superscript𝕊𝐷1\displaystyle\frac{\Gamma(\frac{d_{2}}{2})}{\sqrt{\pi}\Gamma(\frac{d_{2}-1}{2}% )}\int_{0}^{\psi_{2}}\sin^{d_{2}-2}xdx=\sigma_{d_{2}-1}(P^{d_{2}}[c_{2},\psi_{% 2}]\cap\mathbb{S}^{D-1})divide start_ARG roman_Γ ( divide start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG square-root start_ARG italic_π end_ARG roman_Γ ( divide start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 1 end_ARG start_ARG 2 end_ARG ) end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_sin start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 2 end_POSTSUPERSCRIPT italic_x italic_d italic_x = italic_σ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ∩ blackboard_S start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT )
\displaystyle\leq i=1Mσd21(Pd2[ai,2θ]𝕊D1)=Mσd21(Pd2[ai,2θ])=MΓ(d22)πΓ(d212)02θsind22xdx.superscriptsubscript𝑖1𝑀subscript𝜎subscript𝑑21superscript𝑃subscript𝑑2subscript𝑎𝑖2𝜃superscript𝕊𝐷1𝑀subscript𝜎subscript𝑑21superscript𝑃subscript𝑑2subscript𝑎𝑖2𝜃𝑀Γsubscript𝑑22𝜋Γsubscript𝑑212superscriptsubscript02𝜃superscriptsubscript𝑑22𝑥𝑑𝑥\displaystyle\sum_{i=1}^{M}\sigma_{d_{2}-1}(P^{d_{2}}[a_{i},2\theta]\cap% \mathbb{S}^{D-1})=M\sigma_{d_{2}-1}(P^{d_{2}}[a_{i},2\theta])=M\frac{\Gamma(% \frac{d_{2}}{2})}{\sqrt{\pi}\Gamma(\frac{d_{2}-1}{2})}\int_{0}^{2\theta}\sin^{% d_{2}-2}xdx.∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 2 italic_θ ] ∩ blackboard_S start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT ) = italic_M italic_σ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 2 italic_θ ] ) = italic_M divide start_ARG roman_Γ ( divide start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG square-root start_ARG italic_π end_ARG roman_Γ ( divide start_ARG italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 1 end_ARG start_ARG 2 end_ARG ) end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_θ end_POSTSUPERSCRIPT roman_sin start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 2 end_POSTSUPERSCRIPT italic_x italic_d italic_x .

On the other hand, since PD[ai,θ]superscript𝑃𝐷subscript𝑎𝑖𝜃P^{D}[a_{i},\theta]italic_P start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT [ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ ]’s are disjoint from each other and that PD[ai,θ]PD[c2,ψ2]superscript𝑃𝐷subscript𝑎𝑖𝜃superscript𝑃𝐷subscript𝑐2subscript𝜓2P^{D}[a_{i},\theta]\subset P^{D}[c_{2},\psi_{2}]italic_P start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT [ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ ] ⊂ italic_P start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT [ italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] (because ϵ=sinθitalic-ϵ𝜃\epsilon=\sin\thetaitalic_ϵ = roman_sin italic_θ), we know

μ(Cnϵ)𝜇superscriptsubscript𝐶𝑛italic-ϵ\displaystyle\mu(C_{n}^{\epsilon})italic_μ ( italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ) i=1MσD1(PD[ai,θ]𝕊D1)=MσD1(PD[ai,θ]𝕊D1)absentsuperscriptsubscript𝑖1𝑀subscript𝜎𝐷1superscript𝑃𝐷subscript𝑎𝑖𝜃superscript𝕊𝐷1𝑀subscript𝜎𝐷1superscript𝑃𝐷subscript𝑎𝑖𝜃superscript𝕊𝐷1\displaystyle\geq\sum_{i=1}^{M}\sigma_{D-1}(P^{D}[a_{i},\theta]\cap\mathbb{S}^% {D-1})=M\sigma_{D-1}(P^{D}[a_{i},\theta]\cap\mathbb{S}^{D-1})≥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT [ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ ] ∩ blackboard_S start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT ) = italic_M italic_σ start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT [ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ ] ∩ blackboard_S start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT )
=MΓ(D2)πΓ(D12)0θsinD2xdxabsent𝑀Γ𝐷2𝜋Γ𝐷12superscriptsubscript0𝜃superscript𝐷2𝑥𝑑𝑥\displaystyle=M\frac{\Gamma(\frac{D}{2})}{\sqrt{\pi}\Gamma(\frac{D-1}{2})}\int% _{0}^{\theta}\sin^{D-2}xdx= italic_M divide start_ARG roman_Γ ( divide start_ARG italic_D end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG square-root start_ARG italic_π end_ARG roman_Γ ( divide start_ARG italic_D - 1 end_ARG start_ARG 2 end_ARG ) end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT roman_sin start_POSTSUPERSCRIPT italic_D - 2 end_POSTSUPERSCRIPT italic_x italic_d italic_x
Γ(D2)Γ(D12)0ψ2sind22xdx0θsinD2xdx02θsind22xdx.absentΓ𝐷2Γ𝐷12superscriptsubscript0subscript𝜓2superscriptsubscript𝑑22𝑥𝑑𝑥superscriptsubscript0𝜃superscript𝐷2𝑥𝑑𝑥superscriptsubscript02𝜃superscriptsubscript𝑑22𝑥𝑑𝑥\displaystyle\geq\frac{\Gamma(\frac{D}{2})}{\Gamma(\frac{D-1}{2})}\frac{\int_{% 0}^{\psi_{2}}\sin^{d_{2}-2}xdx\int_{0}^{\theta}\sin^{D-2}xdx}{\int_{0}^{2% \theta}\sin^{d_{2}-2}xdx}.≥ divide start_ARG roman_Γ ( divide start_ARG italic_D end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG roman_Γ ( divide start_ARG italic_D - 1 end_ARG start_ARG 2 end_ARG ) end_ARG divide start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_sin start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 2 end_POSTSUPERSCRIPT italic_x italic_d italic_x ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT roman_sin start_POSTSUPERSCRIPT italic_D - 2 end_POSTSUPERSCRIPT italic_x italic_d italic_x end_ARG start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_θ end_POSTSUPERSCRIPT roman_sin start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 2 end_POSTSUPERSCRIPT italic_x italic_d italic_x end_ARG .

Next we upper bound μ(C0ϵ)𝜇superscriptsubscript𝐶0italic-ϵ\mu(C_{0}^{\epsilon})italic_μ ( italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ). For any (x1,,xn)𝔹n:={(x1,,xn):i=1nxi21}subscript𝑥1subscript𝑥𝑛superscript𝔹𝑛assignconditional-setsubscript𝑥1subscript𝑥𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝑥𝑖21(x_{1},\cdots,x_{n})\in\mathbb{B}^{n}:=\{(x_{1},\ldots,x_{n}):\sum_{i=1}^{n}x_% {i}^{2}\leq 1\}( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∈ blackboard_B start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT := { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) : ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 1 }, we introduce the hyperspherical coordinate system, which consists of a radial coordinate r𝑟ritalic_r, and n1𝑛1n-1italic_n - 1 angular coordinates ϕ1,,ϕn1subscriptitalic-ϕ1subscriptitalic-ϕ𝑛1\phi_{1},\ldots,\phi_{n-1}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT, where the angles ϕ1,,ϕn2subscriptitalic-ϕ1subscriptitalic-ϕ𝑛2\phi_{1},\cdots,\phi_{n-2}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_ϕ start_POSTSUBSCRIPT italic_n - 2 end_POSTSUBSCRIPT range over [0,π]0𝜋[0,\pi][ 0 , italic_π ] and ϕn1subscriptitalic-ϕ𝑛1\phi_{n-1}italic_ϕ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ranges over [0,2π)02𝜋[0,2\pi)[ 0 , 2 italic_π ). In specific, the coordinates are defined through the transformation:

x1subscript𝑥1\displaystyle x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =rcosϕ1,absent𝑟subscriptitalic-ϕ1\displaystyle=r\cos\phi_{1},= italic_r roman_cos italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,
x2subscript𝑥2\displaystyle x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =rsinϕ1cosϕ2,absent𝑟subscriptitalic-ϕ1subscriptitalic-ϕ2\displaystyle=r\sin\phi_{1}\cos\phi_{2},= italic_r roman_sin italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_cos italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,
x3subscript𝑥3\displaystyle x_{3}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT =rsinϕ1sinϕ2cosϕ3,absent𝑟subscriptitalic-ϕ1subscriptitalic-ϕ2subscriptitalic-ϕ3\displaystyle=r\sin\phi_{1}\sin\phi_{2}\cos\phi_{3},= italic_r roman_sin italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_sin italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_cos italic_ϕ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ,
\displaystyle\vdots
xn1subscript𝑥𝑛1\displaystyle x_{n-1}italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT =rsinϕ1sinϕn2cosϕn1,absent𝑟subscriptitalic-ϕ1subscriptitalic-ϕ𝑛2subscriptitalic-ϕ𝑛1\displaystyle=r\sin\phi_{1}\cdots\sin\phi_{n-2}\cos\phi_{n-1},= italic_r roman_sin italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ roman_sin italic_ϕ start_POSTSUBSCRIPT italic_n - 2 end_POSTSUBSCRIPT roman_cos italic_ϕ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ,
xnsubscript𝑥𝑛\displaystyle x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =rsinϕ1sinϕn2sinϕn1.absent𝑟subscriptitalic-ϕ1subscriptitalic-ϕ𝑛2subscriptitalic-ϕ𝑛1\displaystyle=r\sin\phi_{1}\cdots\sin\phi_{n-2}\sin\phi_{n-1}.= italic_r roman_sin italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ roman_sin italic_ϕ start_POSTSUBSCRIPT italic_n - 2 end_POSTSUBSCRIPT roman_sin italic_ϕ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT .

By assumption we know C0PD[c1,ψ1]subscript𝐶0superscript𝑃𝐷subscript𝑐1subscript𝜓1C_{0}\subset P^{D}[c_{1},\psi_{1}]italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊂ italic_P start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT [ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ]. Therefore, using the notion of spherical elements (Blumenson, 1960), we can write

μ(C0ϵ)=σD1(C0ϵ𝕊D1)𝜇superscriptsubscript𝐶0italic-ϵsubscript𝜎𝐷1superscriptsubscript𝐶0italic-ϵsuperscript𝕊𝐷1\displaystyle\mu(C_{0}^{\epsilon})=\sigma_{D-1}(C_{0}^{\epsilon}\cap\mathbb{S}% ^{D-1})italic_μ ( italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ) = italic_σ start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ∩ blackboard_S start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT ) =1Area(𝕊D1)ΩsinD2ϕ1sinD3ϕ2sinϕD2d(ϕ1,,ϕD1),absent1Areasuperscript𝕊𝐷1subscriptΩsuperscript𝐷2subscriptitalic-ϕ1superscript𝐷3subscriptitalic-ϕ2subscriptitalic-ϕ𝐷2𝑑subscriptitalic-ϕ1subscriptitalic-ϕ𝐷1\displaystyle=\frac{1}{\mathrm{Area}(\mathbb{S}^{D-1})}\int_{\Omega}\sin^{D-2}% \phi_{1}\sin^{D-3}\phi_{2}\cdots\sin\phi_{D-2}d(\phi_{1},\ldots,\phi_{D-1}),= divide start_ARG 1 end_ARG start_ARG roman_Area ( blackboard_S start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT ) end_ARG ∫ start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT roman_sin start_POSTSUPERSCRIPT italic_D - 2 end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_sin start_POSTSUPERSCRIPT italic_D - 3 end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋯ roman_sin italic_ϕ start_POSTSUBSCRIPT italic_D - 2 end_POSTSUBSCRIPT italic_d ( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT ) ,

where

Ω={(ϕ1,,ϕD1):ϕ1[0,ψ1],ϕ2,,ϕD2[0,π],ϕD1[0,2π],j=1d11sinϕj[0,ϵ]}.Ωconditional-setsubscriptitalic-ϕ1subscriptitalic-ϕ𝐷1formulae-sequencesubscriptitalic-ϕ10subscript𝜓1subscriptitalic-ϕ2formulae-sequencesubscriptitalic-ϕ𝐷20𝜋formulae-sequencesubscriptitalic-ϕ𝐷102𝜋superscriptsubscriptproduct𝑗1subscript𝑑11subscriptitalic-ϕ𝑗0italic-ϵ\textstyle\Omega=\left\{(\phi_{1},\cdots,\phi_{D-1}):\phi_{1}\in[0,\psi_{1}],% \phi_{2},\ldots,\phi_{D-2}\in[0,\pi],\phi_{D-1}\in[0,2\pi],\prod_{j=1}^{d_{1}-% 1}\sin\phi_{j}\in[0,\epsilon]\right\}.roman_Ω = { ( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_ϕ start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT ) : italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ [ 0 , italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] , italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT italic_D - 2 end_POSTSUBSCRIPT ∈ [ 0 , italic_π ] , italic_ϕ start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT ∈ [ 0 , 2 italic_π ] , ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT roman_sin italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ [ 0 , italic_ϵ ] } .

Denoting

Ω1={(ϕ1,,ϕd11):ϕ1[0,ψ1],ϕ2,,ϕd11[0,π],j=1d11sinϕj[0,ϵ]},subscriptΩ1conditional-setsubscriptitalic-ϕ1subscriptitalic-ϕsubscript𝑑11formulae-sequencesubscriptitalic-ϕ10subscript𝜓1subscriptitalic-ϕ2formulae-sequencesubscriptitalic-ϕsubscript𝑑110𝜋superscriptsubscriptproduct𝑗1subscript𝑑11subscriptitalic-ϕ𝑗0italic-ϵ\textstyle\Omega_{1}=\left\{(\phi_{1},\cdots,\phi_{d_{1}-1}):\phi_{1}\in[0,% \psi_{1}],\phi_{2},\ldots,\phi_{d_{1}-1}\in[0,\pi],\prod_{j=1}^{d_{1}-1}\sin% \phi_{j}\in[0,\epsilon]\right\},roman_Ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { ( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_ϕ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ) : italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ [ 0 , italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] , italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ∈ [ 0 , italic_π ] , ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT roman_sin italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ [ 0 , italic_ϵ ] } ,

then we have

μ(C0ϵ)𝜇superscriptsubscript𝐶0italic-ϵ\displaystyle\mu(C_{0}^{\epsilon})italic_μ ( italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ) =1Area(𝕊D1)(ϕ1,,ϕd11)Ω1sinD2ϕ1sinDd1ϕd11d(ϕ1,,ϕd11)absent1Areasuperscript𝕊𝐷1subscriptsubscriptitalic-ϕ1subscriptitalic-ϕsubscript𝑑11subscriptΩ1superscript𝐷2subscriptitalic-ϕ1superscript𝐷subscript𝑑1subscriptitalic-ϕsubscript𝑑11𝑑subscriptitalic-ϕ1subscriptitalic-ϕsubscript𝑑11\displaystyle=\frac{1}{\mathrm{Area}(\mathbb{S}^{D-1})}\int_{(\phi_{1},\ldots,% \phi_{d_{1}-1})\in\Omega_{1}}\sin^{D-2}\phi_{1}\cdots\sin^{D-d_{1}}\phi_{d_{1}% -1}d(\phi_{1},\ldots,\phi_{d_{1}-1})= divide start_ARG 1 end_ARG start_ARG roman_Area ( blackboard_S start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT ) end_ARG ∫ start_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ) ∈ roman_Ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sin start_POSTSUPERSCRIPT italic_D - 2 end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ roman_sin start_POSTSUPERSCRIPT italic_D - italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT italic_d ( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT )
0π0π02πsinDd11ϕd1sinϕD2dϕd1dϕD1superscriptsubscript0𝜋superscriptsubscript0𝜋superscriptsubscript02𝜋superscript𝐷subscript𝑑11subscriptitalic-ϕsubscript𝑑1subscriptitalic-ϕ𝐷2𝑑subscriptitalic-ϕsubscript𝑑1𝑑subscriptitalic-ϕ𝐷1\displaystyle\qquad\int_{0}^{\pi}\cdots\int_{0}^{\pi}\int_{0}^{2\pi}\sin^{D-d_% {1}-1}\phi_{d_{1}}\cdots\sin\phi_{D-2}d\phi_{d_{1}}\cdots d\phi_{D-1}∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ⋯ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_π end_POSTSUPERSCRIPT roman_sin start_POSTSUPERSCRIPT italic_D - italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ roman_sin italic_ϕ start_POSTSUBSCRIPT italic_D - 2 end_POSTSUBSCRIPT italic_d italic_ϕ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_d italic_ϕ start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT
=Area(𝕊Dd1)Area(𝕊D1)(ϕ1,,ϕd11)Ω1sinD2ϕ1sinDd1ϕd11d(ϕ1,,ϕd11)absentAreasuperscript𝕊𝐷subscript𝑑1Areasuperscript𝕊𝐷1subscriptsubscriptitalic-ϕ1subscriptitalic-ϕsubscript𝑑11subscriptΩ1superscript𝐷2subscriptitalic-ϕ1superscript𝐷subscript𝑑1subscriptitalic-ϕsubscript𝑑11𝑑subscriptitalic-ϕ1subscriptitalic-ϕsubscript𝑑11\displaystyle=\frac{\mathrm{Area}(\mathbb{S}^{D-d_{1}})}{\mathrm{Area}(\mathbb% {S}^{D-1})}\int_{(\phi_{1},\ldots,\phi_{d_{1}-1})\in\Omega_{1}}\sin^{D-2}\phi_% {1}\cdots\sin^{D-d_{1}}\phi_{d_{1}-1}d(\phi_{1},\ldots,\phi_{d_{1}-1})= divide start_ARG roman_Area ( blackboard_S start_POSTSUPERSCRIPT italic_D - italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_ARG start_ARG roman_Area ( blackboard_S start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT ) end_ARG ∫ start_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ) ∈ roman_Ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sin start_POSTSUPERSCRIPT italic_D - 2 end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ roman_sin start_POSTSUPERSCRIPT italic_D - italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT italic_d ( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT )
Area(𝕊Dd1)Area(𝕊D1)ϵDd10ψ10π0πsind12ϕ1sinϕd12dϕ1dϕd11absentAreasuperscript𝕊𝐷subscript𝑑1Areasuperscript𝕊𝐷1superscriptitalic-ϵ𝐷subscript𝑑1superscriptsubscript0subscript𝜓1superscriptsubscript0𝜋superscriptsubscript0𝜋superscriptsubscript𝑑12subscriptitalic-ϕ1subscriptitalic-ϕsubscript𝑑12𝑑subscriptitalic-ϕ1𝑑subscriptitalic-ϕsubscript𝑑11\displaystyle\leq\frac{\mathrm{Area}(\mathbb{S}^{D-d_{1}})}{\mathrm{Area}(% \mathbb{S}^{D-1})}\epsilon^{D-d_{1}}\int_{0}^{\psi_{1}}\int_{0}^{\pi}\cdots% \int_{0}^{\pi}\sin^{d_{1}-2}\phi_{1}\cdots\sin\phi_{d_{1}-2}d\phi_{1}\cdots d% \phi_{d_{1}-1}≤ divide start_ARG roman_Area ( blackboard_S start_POSTSUPERSCRIPT italic_D - italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_ARG start_ARG roman_Area ( blackboard_S start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT ) end_ARG italic_ϵ start_POSTSUPERSCRIPT italic_D - italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ⋯ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT roman_sin start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 2 end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ roman_sin italic_ϕ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 2 end_POSTSUBSCRIPT italic_d italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_d italic_ϕ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT
=Area(𝕊Dd1)Area(𝕊d11)2Area(𝕊D1)σd11(Pd1[c1,ψ1]𝕊D1)absentAreasuperscript𝕊𝐷subscript𝑑1Areasuperscript𝕊subscript𝑑112Areasuperscript𝕊𝐷1subscript𝜎subscript𝑑11superscript𝑃subscript𝑑1subscript𝑐1subscript𝜓1superscript𝕊𝐷1\displaystyle=\frac{\mathrm{Area}(\mathbb{S}^{D-d_{1}})\mathrm{Area}(\mathbb{S% }^{d_{1}-1})}{2\mathrm{Area}(\mathbb{S}^{D-1})}\sigma_{d_{1}-1}(P^{d_{1}}[c_{1% },\psi_{1}]\cap\mathbb{S}^{D-1})= divide start_ARG roman_Area ( blackboard_S start_POSTSUPERSCRIPT italic_D - italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) roman_Area ( blackboard_S start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ) end_ARG start_ARG 2 roman_A roman_r roman_e roman_a ( blackboard_S start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT ) end_ARG italic_σ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ( italic_P start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ∩ blackboard_S start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT )
=Γ(D2)Γ(Dd1+12)Γ(d112)ϵDd10ψ1sind12xdx.absentΓ𝐷2Γ𝐷subscript𝑑112Γsubscript𝑑112superscriptitalic-ϵ𝐷subscript𝑑1superscriptsubscript0subscript𝜓1superscriptsubscript𝑑12𝑥𝑑𝑥\displaystyle=\frac{\Gamma(\frac{D}{2})}{\Gamma(\frac{D-d_{1}+1}{2})\Gamma(% \frac{d_{1}-1}{2})}\epsilon^{D-d_{1}}\int_{0}^{\psi_{1}}\sin^{d_{1}-2}xdx.= divide start_ARG roman_Γ ( divide start_ARG italic_D end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG roman_Γ ( divide start_ARG italic_D - italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 end_ARG start_ARG 2 end_ARG ) roman_Γ ( divide start_ARG italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_ARG start_ARG 2 end_ARG ) end_ARG italic_ϵ start_POSTSUPERSCRIPT italic_D - italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_sin start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 2 end_POSTSUPERSCRIPT italic_x italic_d italic_x .

Thus, we conclude that

μ(C0ϵ)μ(Cnϵ)𝜇superscriptsubscript𝐶0italic-ϵ𝜇superscriptsubscript𝐶𝑛italic-ϵ\displaystyle\frac{\mu(C_{0}^{\epsilon})}{\mu(C_{n}^{\epsilon})}divide start_ARG italic_μ ( italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_μ ( italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT ) end_ARG Γ(Dd1+12)Γ(d112)Γ(D12)0ψ1sind12xdx02arcsinϵsind22xdx0ψ2sind22xdx0arcsinϵsinD2xdxϵDd1absentΓ𝐷subscript𝑑112Γsubscript𝑑112Γ𝐷12superscriptsubscript0subscript𝜓1superscriptsubscript𝑑12𝑥𝑑𝑥superscriptsubscript02italic-ϵsuperscriptsubscript𝑑22𝑥𝑑𝑥superscriptsubscript0subscript𝜓2superscriptsubscript𝑑22𝑥𝑑𝑥superscriptsubscript0italic-ϵsuperscript𝐷2𝑥𝑑𝑥superscriptitalic-ϵ𝐷subscript𝑑1\displaystyle\leq\frac{\Gamma(\frac{D-d_{1}+1}{2})\Gamma(\frac{d_{1}-1}{2})}{% \Gamma(\frac{D-1}{2})}\frac{\int_{0}^{\psi_{1}}\sin^{d_{1}-2}xdx\int_{0}^{2% \arcsin\epsilon}\sin^{d_{2}-2}xdx}{\int_{0}^{\psi_{2}}\sin^{d_{2}-2}xdx\int_{0% }^{\arcsin\epsilon}\sin^{D-2}xdx}\epsilon^{D-d_{1}}≤ divide start_ARG roman_Γ ( divide start_ARG italic_D - italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 end_ARG start_ARG 2 end_ARG ) roman_Γ ( divide start_ARG italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG roman_Γ ( divide start_ARG italic_D - 1 end_ARG start_ARG 2 end_ARG ) end_ARG divide start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_sin start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 2 end_POSTSUPERSCRIPT italic_x italic_d italic_x ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 roman_arcsin italic_ϵ end_POSTSUPERSCRIPT roman_sin start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 2 end_POSTSUPERSCRIPT italic_x italic_d italic_x end_ARG start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_sin start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 2 end_POSTSUPERSCRIPT italic_x italic_d italic_x ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_arcsin italic_ϵ end_POSTSUPERSCRIPT roman_sin start_POSTSUPERSCRIPT italic_D - 2 end_POSTSUPERSCRIPT italic_x italic_d italic_x end_ARG italic_ϵ start_POSTSUPERSCRIPT italic_D - italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
ϵDd1ϵd21ϵD1=ϵd2d1.less-than-or-similar-toabsentsuperscriptitalic-ϵ𝐷subscript𝑑1superscriptitalic-ϵsubscript𝑑21superscriptitalic-ϵ𝐷1superscriptitalic-ϵsubscript𝑑2subscript𝑑1\displaystyle\lesssim\epsilon^{D-d_{1}}\frac{\epsilon^{d_{2}-1}}{\epsilon^{D-1% }}=\epsilon^{d_{2}-d_{1}}.≲ italic_ϵ start_POSTSUPERSCRIPT italic_D - italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_ϵ start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT end_ARG = italic_ϵ start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .

Refer to caption
Figure 7: Histogram of embedding vector norms.

Norm of Embedding Vectors

In Section 4, we assume that the embedding vectors have the unit norm. To verify if this is reasonable, we plot the density of the norms of vocabulary embeddings for the LLaMA2-7B-chat in Figure 7. We can observe that the norms are quite concentrated around 1111.

Appendix C Does RLHF help?

Given how RLHF Ouyang et al. (2022); Ziegler et al. (2019) train the model, the model should be trained to pay more attention to the system prompt so to increase user satisfaction. In Figure 8, we show that RLHF could increase the portion of attention paid to the system prompts by comparing LLaMA2-7B and LLaMA2-7B-chat. The latter is trained on top of the former with human feedback. It shows that RLHF indeed helps in combating instruction drift, but it still cannot eradicate it entirely due to its nature of fine-tuning.

Refer to caption
Figure 8: Comparison of attention decay between LLaMA2-7B before and after RLHF training. Different from the categorical palette used in Figure 4 to differentiate number of rounds when the answer is generated. The deeper the color, the later the round in which the answer is generated.

Appendix D Additional Instruction Drift Experiments

To see how close-source model compares with LLaMA2-70B-chat, we test gpt-3.5-turbo-16k with a total of 200200200200 randomly sampled system prompt pairs. Results are shown in Figure 9. It turns out that gpt-3.5-turbo-16k holds to its system prompt better than LLaMA2-chat-70B, but still suffers a 10%percent1010\%10 % drop on the stability of its original system prompt.

Refer to caption
Refer to caption
Figure 9: Measuring the instruction stability of gpt-3.5-turbo-16k via API using the same protocol as Figure 3. On the left, the system prompt is given to the API via the “system” argument; on the right, it is prepended to the user’s first utterance.

Appendix E Discussion of Split-softmax Formula

We first quickly show how the post-intervention attention values in Equation 6 still form a distribution by summing up to 1111, drop** subscript t𝑡titalic_t:

αisubscriptsuperscript𝛼𝑖\displaystyle\sum\alpha^{\prime}_{i}∑ italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =i|sB|αi+i>|sB|αiabsentsubscript𝑖subscript𝑠𝐵subscriptsuperscript𝛼𝑖subscript𝑖subscript𝑠𝐵subscriptsuperscript𝛼𝑖\displaystyle=\sum_{i\leq|s_{B}|}\alpha^{\prime}_{i}+\sum_{i>|s_{B}|}\alpha^{% \prime}_{i}= ∑ start_POSTSUBSCRIPT italic_i ≤ | italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i > | italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
=πk(t)π(t)i|sB|αi+1πk(t)1π(t)i>|sB|αiabsentsuperscript𝜋𝑘𝑡𝜋𝑡subscript𝑖subscript𝑠𝐵subscript𝛼𝑖1superscript𝜋𝑘𝑡1𝜋𝑡subscript𝑖subscript𝑠𝐵subscript𝛼𝑖\displaystyle=\frac{\pi^{k}(t)}{\pi(t)}\sum_{i\leq|s_{B}|}\alpha_{i}+\frac{1-% \pi^{k}(t)}{1-\pi(t)}\sum_{i>|s_{B}|}\alpha_{i}= divide start_ARG italic_π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_t ) end_ARG start_ARG italic_π ( italic_t ) end_ARG ∑ start_POSTSUBSCRIPT italic_i ≤ | italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG 1 - italic_π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_t ) end_ARG start_ARG 1 - italic_π ( italic_t ) end_ARG ∑ start_POSTSUBSCRIPT italic_i > | italic_s start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
=πk(t)+(1πk(t))absentsuperscript𝜋𝑘𝑡1superscript𝜋𝑘𝑡\displaystyle=\pi^{k}(t)+\left(1-\pi^{k}(t)\right)= italic_π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_t ) + ( 1 - italic_π start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_t ) )
=1absent1\displaystyle=1= 1

Meanwhile, it is worth-noting that the ratios of attention scores for tokens within the system prompt and within conversation history remain unchanged, thereby minimizing disruption to the attention mechanism.