3D-RPE: Enhancing Long-Context Modeling Through 3D Rotary Position Encoding

Xindian Ma1, Wenyuan Liu1, Peng Zhang1  , Nan Xu2
1 College of Intelligence and Computing, Tian** University, Tian**, China
2 Bei**g Wenge Technology Co. Ltd.
{xindianma, lwy2020, pzhang}@tju.edu.cn
{xunan2015}@ia.ac.cn
Corresponding Author: Peng Zhang
Abstract

Inspired by the Bloch Sphere representation, we propose a novel rotary position encoding on a three-dimensional sphere, named 3D Rotary Position Encoding (3D-RPE). 3D-RPE is an advanced version of the widely used 2D Rotary Position Encoding (RoPE), with two major advantages for modeling long contexts: controllable long-term decay and improved position resolution. For controllable long-term decay, 3D-RPE allows for the regulation of long-term decay within the chunk size, ensuring the modeling of relative positional information between tokens at a distant relative position. For enhanced position resolution, 3D-RPE can mitigate the degradation of position resolution caused by position interpolation on RoPE. We have conducted experiments on long-context Natural Language Understanding (NLU) and long-sequence Language Modeling (LM) tasks. From the experimental results, 3D-RPE achieved performance improvements over RoPE, especially in long-context NLU tasks.

Refer to caption
Figure 1: 2D Rotary Position Encoding (RoPE) vs. 3D Rotary Position Encoding (3D-RPE).

1 Introduction

Rotary Position Encoding (RoPE) [23] is essential in Transformer-based Large Language Models (LLMs), such as the LLaMA models [24]. RoPE merges the advantages of absolute and relative positional encoding by using a rotation mechanism to represent each position. Despite its widespread use in LLMs [24, 27, 7], RoPE has notable limitations when extending LLMs with a predefined context window. The long-term decay problem of RoPE limits the model’s ability to extend positions outward in long-context tasks. Although the long-context modeling capability of LLMs can be extended through position interpolation, as more positions are inserted, RoPE encounters the challenge of decreased position resolution.

We propose a novel position encoding mechanism for transformer architecture, called 3D Rotary Position Encoding (3D-RPE), to address challenges in long-context modeling faced by LLMs using RoPE. Inspired by the Bloch Sphere representation, 3D-RPE applies rotary position encoding on a three-dimensional spherical surface, as illustrated in Figure 1(b). In contrast, RoPE employs rotation on a two-dimensional circular path, as depicted in Figure 1(a). RoPE suffers from long-term decay, as shown in Figure 1(c), implying that as the relative distance increases, the relative upper bound on token correlations at modeled relative positions will continuously decrease. 3D-RPE addresses this issue by segmenting a long sequence into chunks and setting rotation angles within and between the chunks to construct position encoding. As shown in Figure 1(d), 3D-RPE is able to control this relative upper bound through two relative positional dimensions, namely within and between chunks. Compared to Figure 1(c), this method improves the upper bound on correlations between long relative distances and alleviates the issue of long-term decay.

Position Interpolation (PI) methods [4, 18] based on RoPE are often employed to extend LLMs for modeling contexts that exceed the pre-training length. These techniques scale the position encoding during inference, allowing the originally out-of-range position encoding to fall within the trained position interval after interpolation. However, as the interpolation factor increases, PI experiences a substantial decline in positional resolution among tokens, detrimentally affecting long-context modeling performance. As illustrated in Figure 1(e), extending the pre-training length Lpsubscript𝐿𝑝L_{p}italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT to L𝐿Litalic_L using linear PI [4] leads to reduced positional resolution with increasing L𝐿Litalic_L. 3D-RPE employs a 3D rotating sphere for position encoding, which supports higher positional resolution compared to the 2D circular rotation. Similarly, through linear PI extension, 3D-RPE achieves a positional resolution superior to LpLsubscript𝐿𝑝𝐿\frac{L_{p}}{L}divide start_ARG italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG italic_L end_ARG (See Figure 1(f)). This benefit has been theoretically proven (Refer to Theorem 1 in Section 3.2.2) and corroborated by experimental results (Refer to Table 4 in Section 5.2).

We conducted experiments on long-sequence Language Modeling (LM) and long-context Natural Language Understanding (NLU) tasks. Our experimental results highlight the promising performance of the 3D-RPE method, especially in tasks requiring long-context language understanding.

Our major contributions of this paper are as follows:

  • A position encoding method on a 3D sphere, 3D-RPE, is provided, which can enhance the long-context modeling capability of LLMs by replacing RoPE.

  • It is proved that 3D-RPE has two benefits, controllable long-term decay and mitigating the reduction in positional resolution caused by position interpolation.

  • LLMs combine with 3D-RPE have achieved significant performance improvements in long-context NLU tasks.

The structure of this paper is as follows. Section 2 covers the preliminaries of 3D-RPE, Bloch Sphere, and RoPE. Section 3 explains the construction of 3D-RPE on a 3D rotating sphere and highlights its benefits over RoPE. Section 4 reviews related work. In Section 5, we validated the advantages of our method through experiments. Section 6 concludes with a discussion on 3D-RPE’s impact.

2 Preliminaries

The analysis of 3D-RPE relies on these concepts and results from the filed of Bloch Sphere and RoPE. We offer an introduction to Bloch Sphere in Section 2.1 and RoPE [23] in Section 2.2.

2.1 Bloch Sphere

Bloch Sphere (BS) offers a geometric depiction of a quantum mechanical system’s pure state, limited to two levels. The state vector |ϕketitalic-ϕ|\phi\rangle| italic_ϕ ⟩ is mathematically expressed as

|ϕ=eiθ(cosφ2|0+sinφ2eiθ1|1)ketitalic-ϕsuperscript𝑒i𝜃𝜑2ket0𝜑2superscript𝑒𝑖subscript𝜃1ket1\displaystyle\ket{\phi}=e^{\mathrm{i}\theta}(\cos{\frac{\varphi}{2}}\ket{0}+% \sin{\frac{\varphi}{2}}e^{i\theta_{1}}\ket{1})| start_ARG italic_ϕ end_ARG ⟩ = italic_e start_POSTSUPERSCRIPT roman_i italic_θ end_POSTSUPERSCRIPT ( roman_cos divide start_ARG italic_φ end_ARG start_ARG 2 end_ARG | start_ARG 0 end_ARG ⟩ + roman_sin divide start_ARG italic_φ end_ARG start_ARG 2 end_ARG italic_e start_POSTSUPERSCRIPT italic_i italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | start_ARG 1 end_ARG ⟩ ) (1)

where |0ket0|0\rangle| 0 ⟩ and |1ket1|1\rangle| 1 ⟩ are Dirac’s notations. θ𝜃{\theta}italic_θ, θ1subscript𝜃1{\theta}_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and φ𝜑\varphiitalic_φ are rotation angles. In our work, θ𝜃\thetaitalic_θ encodes the relative positions of tokens within chunks, φ𝜑\varphiitalic_φ encodes the relative positions of tokens across chunks, and θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is equal to 00. Some other concepts about BS are showed in Supplementary Materials A.

2.2 Rotary Position Embedding

Rotary Position Embedding (RoPE) is a commonly used relative position encoding technique in LLMs, such as LLaMA [24], GPT-J [27], Vicuna [7] and etc. RoPE is a 2-dimensional space rotary encoding, which is denoted as follows:

RoPE(𝒉m,m)=eimθ𝒉m,RoPE(𝒉n,n)=einθ𝒉nformulae-sequence𝑅𝑜𝑃𝐸subscript𝒉𝑚𝑚superscript𝑒i𝑚𝜃subscript𝒉𝑚𝑅𝑜𝑃𝐸subscript𝒉𝑛𝑛superscript𝑒i𝑛𝜃subscript𝒉𝑛\displaystyle RoPE(\bm{h}_{m},m)=e^{\mathrm{i}m\theta}\bm{h}_{m}\leavevmode% \nobreak\ ,\leavevmode\nobreak\ RoPE(\bm{h}_{n},n)=e^{\mathrm{i}n\theta}\bm{h}% _{n}italic_R italic_o italic_P italic_E ( bold_italic_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_m ) = italic_e start_POSTSUPERSCRIPT roman_i italic_m italic_θ end_POSTSUPERSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_R italic_o italic_P italic_E ( bold_italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n ) = italic_e start_POSTSUPERSCRIPT roman_i italic_n italic_θ end_POSTSUPERSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (2)

𝒉msubscript𝒉𝑚\bm{h}_{m}bold_italic_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and 𝒉nsubscript𝒉𝑛\bm{h}_{n}bold_italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are hidden vectors from the query and key for a specific attention head in transformer. For ease of differentiation, 𝒉msubscript𝒉𝑚\bm{h}_{m}bold_italic_h start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and 𝒉nsubscript𝒉𝑛\bm{h}_{n}bold_italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT can be refined later as 𝒒msubscript𝒒𝑚\bm{q}_{m}bold_italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and 𝒌nsubscript𝒌𝑛\bm{k}_{n}bold_italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, ii\mathrm{i}roman_i is the imaginary unit, θ𝜃\thetaitalic_θ is the rotary angle in RoPE. m𝑚mitalic_m and n𝑛nitalic_n are indexes about positions. Then, the inner product is employed to define the self-attention score before softmax computing:

s(mn,𝒒m,𝒌n)𝑠𝑚𝑛subscript𝒒𝑚subscript𝒌𝑛\displaystyle s(m-n,\bm{q}_{m},\bm{k}_{n})italic_s ( italic_m - italic_n , bold_italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) =RoPE(𝒒m,m),RoPE(𝒌n,n)absent𝑅𝑜𝑃𝐸subscript𝒒𝑚𝑚𝑅𝑜𝑃𝐸subscript𝒌𝑛𝑛\displaystyle=\langle RoPE(\bm{q}_{m},m),RoPE(\bm{k}_{n},n)\rangle= ⟨ italic_R italic_o italic_P italic_E ( bold_italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_m ) , italic_R italic_o italic_P italic_E ( bold_italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n ) ⟩ (3)
=Re[l=0d/21𝒒[2l:2l+1]𝒌[2l:2l+1]ei(mn)θl]absent𝑅𝑒delimited-[]superscriptsubscript𝑙0𝑑21subscript𝒒delimited-[]:2𝑙2𝑙1subscript𝒌delimited-[]:2𝑙2𝑙1superscript𝑒i𝑚𝑛subscript𝜃𝑙\displaystyle=Re[\sum_{l=0}^{{d/2}-1}\bm{q}_{[2l:{2l+1}]}\bm{k}_{[2l:{2l+1}]}e% ^{\mathrm{i}(m-n)\theta_{l}}]= italic_R italic_e [ ∑ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d / 2 - 1 end_POSTSUPERSCRIPT bold_italic_q start_POSTSUBSCRIPT [ 2 italic_l : 2 italic_l + 1 ] end_POSTSUBSCRIPT bold_italic_k start_POSTSUBSCRIPT [ 2 italic_l : 2 italic_l + 1 ] end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT roman_i ( italic_m - italic_n ) italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ]

Eq (3) is unary function respect to the relative position (mn)𝑚𝑛(m-n)( italic_m - italic_n ), representing the relative position between tokens and modeling the relative positional information. Here, Re[]𝑅𝑒delimited-[]Re[\cdot]italic_R italic_e [ ⋅ ] denotes the calculation of the real part of a complex number. In our study, the 3D-RPE self-attention score is a binary function containing the relative position (mn)𝑚𝑛(m-n)( italic_m - italic_n ).

3 Method

Section 3.1 introduces the new position encoding on a 3D sphere, 3D-RPE. Section 3.2 focuses on analyzing two benefits of 3D-RPE, namely controllable long-term decay and enhanced position resolution.

Refer to caption
Figure 2: Visualization of the 3D Rotary Position Encoding (3D-RPE). The context size is L𝐿Litalic_L, and the chunk size is c𝑐citalic_c. The vectors [𝒉j,m1,𝒉j,m2]Tsuperscriptsuperscriptsubscript𝒉𝑗𝑚1superscriptsubscript𝒉𝑗𝑚2𝑇{[\bm{h}_{j,m}^{1},\bm{h}_{j,m}^{2}]}^{T}[ bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and [𝒉j,m2,𝒉j,m1]Tsuperscriptsuperscriptsubscript𝒉𝑗𝑚2superscriptsubscript𝒉𝑗𝑚1𝑇{[-\bm{h}_{j,m}^{2},\bm{h}_{j,m}^{1}]}^{T}[ - bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT form an orthogonal basis, corresponding to the |1ket1\ket{1}| start_ARG 1 end_ARG ⟩ and |0ket0\ket{0}| start_ARG 0 end_ARG ⟩ states in Eq. (1). The components 𝒉j,m1superscriptsubscript𝒉𝑗𝑚1\bm{h}_{j,m}^{1}bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and 𝒉j,m2superscriptsubscript𝒉𝑗𝑚2\bm{h}_{j,m}^{2}bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT represent the first and second dimensions of the state vector 𝒉j,msubscript𝒉𝑗𝑚\bm{h}_{j,m}bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT, which is the mthsubscript𝑚𝑡m_{th}italic_m start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT token in the jthsubscript𝑗𝑡j_{th}italic_j start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT chunk.

3.1 3D Rotary Position Encoding

For a long sequence of length L𝐿Litalic_L and a chunk size set to c𝑐citalic_c, where c𝑐citalic_c is smaller than the pre-training length of LLM, the sequence can be divided into L/c𝐿𝑐\lceil L/c\rceil⌈ italic_L / italic_c ⌉ chunks. Here, .\lceil.\rceil⌈ . ⌉ represents the ceiling function, rounding up to the nearest integer (see Figure 2). The state vector 𝒉j,msubscript𝒉𝑗𝑚\bm{h}_{j,m}bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT comes from either Query or Key. Here, j[0,L/c1]𝑗0𝐿𝑐1j\in[0,{\lceil L/c\rceil}-1]italic_j ∈ [ 0 , ⌈ italic_L / italic_c ⌉ - 1 ] represents the positional index of the chunk, and m[0,c1]𝑚0𝑐1m\in[0,c-1]italic_m ∈ [ 0 , italic_c - 1 ] indicates the positional index of the token within the chunk. This is used to calculate the new state vector 𝒉~j,msubscript~𝒉𝑗𝑚\widetilde{\bm{h}}_{j,m}over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT by rotating the Bloch Sphere. Specifically, two rotation angles, θ𝜃\thetaitalic_θ and φ𝜑\varphiitalic_φ are defined, with θ𝜃\thetaitalic_θ governing the position encoding within the chunk’s internal tokens, and φ𝜑\varphiitalic_φ governing the position encoding between the chunks. Our position encoding method is called 3D Rotary Position Encoding, or 3D-RPE. The formal definition of 3D-RPE is provided as follows. The computational process of 3D-RPE in practice is provided in Supplementary Materials B.1.

Definition 1 (3D Rotary Position Encoding).

Let 𝐡j,mdsubscript𝐡𝑗𝑚superscript𝑑\bm{h}_{j,m}\in\mathbb{R}^{d}bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be a state vector of an attention head without position encoding, where d𝑑ditalic_d is the dimension of the vector, which is an even number. 3D-RPE encodes 𝐡j,msubscript𝐡𝑗𝑚\bm{h}_{j,m}bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT into the vector 𝐡~j,msubscript~𝐡𝑗𝑚\widetilde{\bm{h}}_{j,m}over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT, which can be formalized as:

𝒉~j,m=eimθ(cosφj𝒉j,m+sinφj𝒉j,m)subscript~𝒉𝑗𝑚superscript𝑒i𝑚𝜃subscript𝜑𝑗superscriptsubscript𝒉𝑗𝑚perpendicular-tosubscript𝜑𝑗subscript𝒉𝑗𝑚\widetilde{\bm{h}}_{j,m}=e^{\mathrm{i}m\theta}(\cos{\varphi_{j}}\bm{h}_{j,m}^{% \perp}+\sin{\varphi_{j}}\bm{h}_{j,m})over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT = italic_e start_POSTSUPERSCRIPT roman_i italic_m italic_θ end_POSTSUPERSCRIPT ( roman_cos italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT + roman_sin italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT ) (4)

ii\mathrm{i}roman_i is the imaginary unit. 𝐡j,msuperscriptsubscript𝐡𝑗𝑚perpendicular-to\bm{h}_{j,m}^{\perp}bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT equals to [𝐡j,m2,𝐡j,m1]Tsuperscriptsuperscriptsubscript𝐡𝑗𝑚2superscriptsubscript𝐡𝑗𝑚1𝑇{[-\bm{h}_{j,m}^{2},\bm{h}_{j,m}^{1}]}^{T}[ - bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, where 𝐡j,m1d/2superscriptsubscript𝐡𝑗𝑚1superscript𝑑2\bm{h}_{j,m}^{1}\in\mathbb{R}^{d/2}bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT and 𝐡j,m2d/2superscriptsubscript𝐡𝑗𝑚2superscript𝑑2\bm{h}_{j,m}^{2}\in\mathbb{R}^{d/2}bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT is the first and second halves of the state vector 𝐡j,msubscript𝐡𝑗𝑚\bm{h}_{j,m}bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT.

In transformer-based LLMs, after applying position encoding to the state vectors from Query and Key, it is essential to compute their attention scores. For the sake of clarity and formalization, we denote the position encoding of the state vector from Query as 3d-PE(𝒒,i,m)𝒒𝑖𝑚(\bm{q},i,m)( bold_italic_q , italic_i , italic_m ) and from Key as 3d-PE(𝒌,j,n)𝒌𝑗𝑛(\bm{k},j,n)( bold_italic_k , italic_j , italic_n ), where i𝑖iitalic_i and j𝑗jitalic_j range from 00 to L/c1𝐿𝑐1{\lceil L/c\rceil}-1⌈ italic_L / italic_c ⌉ - 1, and m𝑚mitalic_m and n𝑛nitalic_n range from 00 to c1𝑐1c-1italic_c - 1. The self-attention score can be obtained through the conjugate symmetric inner product of 𝒒i,msubscript𝒒𝑖𝑚\bm{q}_{i,m}bold_italic_q start_POSTSUBSCRIPT italic_i , italic_m end_POSTSUBSCRIPT and 𝒌j,nsubscript𝒌𝑗𝑛\bm{k}_{j,n}bold_italic_k start_POSTSUBSCRIPT italic_j , italic_n end_POSTSUBSCRIPT, which are the state vectors from Query and Key,

s(𝒒i,m,𝒌j,n,φiφj,mn)=Re[ei(φiφj)l=0d/21ei(mn)θl(𝒒l𝒌l+𝒒d/2+l𝒌d/2+l)]𝑠subscript𝒒𝑖𝑚subscript𝒌𝑗𝑛subscript𝜑𝑖subscript𝜑𝑗𝑚𝑛𝑅𝑒delimited-[]superscript𝑒isubscript𝜑𝑖subscript𝜑𝑗superscriptsubscript𝑙0𝑑21superscript𝑒i𝑚𝑛subscript𝜃𝑙subscript𝒒𝑙subscript𝒌𝑙subscript𝒒𝑑2𝑙subscript𝒌𝑑2𝑙s(\bm{q}_{i,m},\bm{k}_{j,n},\varphi_{i}-\varphi_{j},m-n)=Re[e^{\mathrm{i}(% \varphi_{i}-\varphi_{j})}\sum\limits_{l=0}^{d/2-1}e^{\mathrm{i}(m-n)\theta_{l}% }(\bm{q}_{l}\bm{k}_{l}+\bm{q}_{d/2+l}\bm{k}_{d/2+l})]italic_s ( bold_italic_q start_POSTSUBSCRIPT italic_i , italic_m end_POSTSUBSCRIPT , bold_italic_k start_POSTSUBSCRIPT italic_j , italic_n end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_m - italic_n ) = italic_R italic_e [ italic_e start_POSTSUPERSCRIPT roman_i ( italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d / 2 - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT roman_i ( italic_m - italic_n ) italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_italic_q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT bold_italic_k start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + bold_italic_q start_POSTSUBSCRIPT italic_d / 2 + italic_l end_POSTSUBSCRIPT bold_italic_k start_POSTSUBSCRIPT italic_d / 2 + italic_l end_POSTSUBSCRIPT ) ] (5)

where l[0,d21]𝑙0𝑑21l\in[0,{\frac{d}{2}-1}]italic_l ∈ [ 0 , divide start_ARG italic_d end_ARG start_ARG 2 end_ARG - 1 ], φi=baseisubscript𝜑𝑖𝑏𝑎𝑠superscript𝑒𝑖\varphi_{i}=base^{-i}italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_b italic_a italic_s italic_e start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT and φj=basejsubscript𝜑𝑗𝑏𝑎𝑠superscript𝑒𝑗\varphi_{j}=base^{-j}italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_b italic_a italic_s italic_e start_POSTSUPERSCRIPT - italic_j end_POSTSUPERSCRIPT. Let {𝒒,𝒌}lsubscript𝒒𝒌𝑙{\{\bm{q},\bm{k}\}}_{l}{ bold_italic_q , bold_italic_k } start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT denote the l𝑙litalic_l-th components of {𝒒,𝒌}𝒒𝒌{\{\bm{q},\bm{k}\}}{ bold_italic_q , bold_italic_k }. In experiments using the LLaMA2 models, the base𝑏𝑎𝑠𝑒baseitalic_b italic_a italic_s italic_e is generally set to 10,0001000010,00010 , 000. The self-attention score computed after applying 3d-PE is a function of both the relative position between chunks (φiφjsubscript𝜑𝑖subscript𝜑𝑗\varphi_{i}-\varphi_{j}italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT) and the relative position (mn𝑚𝑛m-nitalic_m - italic_n).

Consequently, the self-attention score relying on 3d-PE is influenced by the relative positions at both the chunk and token levels. It is important to highlight that when 𝒒i,msubscript𝒒𝑖𝑚\bm{q}_{i,m}bold_italic_q start_POSTSUBSCRIPT italic_i , italic_m end_POSTSUBSCRIPT and 𝒌j,nsubscript𝒌𝑗𝑛\bm{k}_{j,n}bold_italic_k start_POSTSUBSCRIPT italic_j , italic_n end_POSTSUBSCRIPT reside within the same chunk (i.e., i=j𝑖𝑗i=jitalic_i = italic_j), Eq. (5) simplifies to the standard RoPE formulation as depicted in Eq. (3). For a detailed derivation and computation process of Eq. (5), as well as the complete formulation of Eq. (4), please refer to Supplementary Materials B.2.

3.2 Benefits of 3D-RPE

In this section, we delve into two benefits offered by 3D-RPE: the ability to control long-term decay and mitigate the reduction in positional resolution caused by position interpolation.

3.2.1 Controllable Long-term Decay

3D-RPE has the property of controllable long-term decay. Like RoPE, taking the absolute value s𝑠sitalic_s in Eq (5) and applying the Abel transformation, we derive the upper bound of the correlation coefficients related to term dependencies as follows:

|s(𝒒i,m,𝒌j,n,φiφj,mn)|𝑠subscript𝒒𝑖𝑚subscript𝒌𝑗𝑛subscript𝜑𝑖subscript𝜑𝑗𝑚𝑛\displaystyle|s(\bm{q}_{i,m},\bm{k}_{j,n},\varphi_{i}-\varphi_{j},m-n)|| italic_s ( bold_italic_q start_POSTSUBSCRIPT italic_i , italic_m end_POSTSUBSCRIPT , bold_italic_k start_POSTSUBSCRIPT italic_j , italic_n end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_m - italic_n ) | |ei(φiφj)||l=0d/21El+1(hl+1hl)|absentsuperscript𝑒isubscript𝜑𝑖subscript𝜑𝑗superscriptsubscript𝑙0𝑑21subscript𝐸𝑙1subscript𝑙1subscript𝑙\displaystyle\leq|e^{\mathrm{i}(\varphi_{i}-\varphi_{j})}||\sum\limits_{l=0}^{% d/2-1}E_{l+1}(h_{l+1}-h_{l})|≤ | italic_e start_POSTSUPERSCRIPT roman_i ( italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT | | ∑ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d / 2 - 1 end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) | (6)
(maxl|hl+1hl|)l=0d/21|El+1|absentsubscript𝑙subscript𝑙1subscript𝑙superscriptsubscript𝑙0𝑑21subscript𝐸𝑙1\displaystyle\leq(\max_{l}|h_{l+1}-h_{l}|)\sum\limits_{l=0}^{d/2-1}|E_{l+1}|≤ ( roman_max start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_h start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | ) ∑ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d / 2 - 1 end_POSTSUPERSCRIPT | italic_E start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT |

where El=k=0l1ei(mn)θksubscript𝐸𝑙superscriptsubscript𝑘0𝑙1superscript𝑒i𝑚𝑛subscript𝜃𝑘E_{l}=\sum_{k=0}^{l-1}e^{\mathrm{i}(m-n)\theta_{k}}italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT roman_i ( italic_m - italic_n ) italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and E0=0subscript𝐸00E_{0}=0italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0. For RoPE [23], the relative upper bound Eropesubscript𝐸𝑟𝑜𝑝𝑒E_{rope}italic_E start_POSTSUBSCRIPT italic_r italic_o italic_p italic_e end_POSTSUBSCRIPT is given by 1d/2j=1d/2|Sj|1𝑑2superscriptsubscript𝑗1𝑑2subscript𝑆𝑗\frac{1}{d/2}\sum_{j=1}^{d/2}|S_{j}|divide start_ARG 1 end_ARG start_ARG italic_d / 2 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT | italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT |, where Sj=t=0j1ei(mn)θtsubscript𝑆𝑗superscriptsubscript𝑡0𝑗1superscript𝑒𝑖𝑚𝑛subscript𝜃𝑡S_{j}=\sum_{t=0}^{j-1}e^{i(m-n)\theta_{t}}italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_i ( italic_m - italic_n ) italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (see the section 3.4.3 of RoPE [23]). By setting θt=100002tdsubscript𝜃𝑡superscript100002𝑡𝑑\theta_{t}=10000^{\frac{-2t}{d}}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 10000 start_POSTSUPERSCRIPT divide start_ARG - 2 italic_t end_ARG start_ARG italic_d end_ARG end_POSTSUPERSCRIPT, the value decays as the relative position (mn)𝑚𝑛(m-n)( italic_m - italic_n ) increases. For the upper bound E3drpesubscript𝐸3𝑑𝑟𝑝𝑒E_{3d-rpe}italic_E start_POSTSUBSCRIPT 3 italic_d - italic_r italic_p italic_e end_POSTSUBSCRIPT of 3D-RPE, it is formalized as follows:

E3drpe=1d/2j=1d/2|El|subscript𝐸3𝑑𝑟𝑝𝑒1𝑑2superscriptsubscript𝑗1𝑑2subscript𝐸𝑙E_{3d-rpe}=\frac{1}{d/2}\sum_{j=1}^{d/2}|E_{l}|italic_E start_POSTSUBSCRIPT 3 italic_d - italic_r italic_p italic_e end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_d / 2 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT | italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | (7)

The domains of the relative position (mn)𝑚𝑛(m-n)( italic_m - italic_n ) differ between E3drpesubscript𝐸3𝑑𝑟𝑝𝑒E_{3d-rpe}italic_E start_POSTSUBSCRIPT 3 italic_d - italic_r italic_p italic_e end_POSTSUBSCRIPT and Eropesubscript𝐸𝑟𝑜𝑝𝑒E_{rope}italic_E start_POSTSUBSCRIPT italic_r italic_o italic_p italic_e end_POSTSUBSCRIPT. In Eropesubscript𝐸𝑟𝑜𝑝𝑒E_{rope}italic_E start_POSTSUBSCRIPT italic_r italic_o italic_p italic_e end_POSTSUBSCRIPT, (mn)𝑚𝑛(m-n)( italic_m - italic_n ) is in the range [0,L1]0𝐿1[0,L-1][ 0 , italic_L - 1 ], while in E3drpesubscript𝐸3𝑑𝑟𝑝𝑒E_{3d-rpe}italic_E start_POSTSUBSCRIPT 3 italic_d - italic_r italic_p italic_e end_POSTSUBSCRIPT, it is in [0,c1]0𝑐1[0,c-1][ 0 , italic_c - 1 ]. The relative positions between tokens exceeding the chunk size c𝑐citalic_c are constructed collaboratively using positional encoding within and across chunks. The Relative Position Matrix 𝑨𝑨\bm{A}bold_italic_A using 3D-RPE is shown in Figure 3.

To illustrate the advantage of controllable long-term decay, we present the results in Figure 1(c) and Figure 1(d). As shown in Figure 1(c), when the relative position (mn)𝑚𝑛(m-n)( italic_m - italic_n ) exceeds approximately 1000100010001000, Eropesubscript𝐸𝑟𝑜𝑝𝑒E_{rope}italic_E start_POSTSUBSCRIPT italic_r italic_o italic_p italic_e end_POSTSUBSCRIPT begins to significantly decrease to below 5555. This limitation of Erope5subscript𝐸𝑟𝑜𝑝𝑒5E_{rope}\leq 5italic_E start_POSTSUBSCRIPT italic_r italic_o italic_p italic_e end_POSTSUBSCRIPT ≤ 5 poses challenges for RoPE in modeling attention scores between tokens with longer relative distances (greater than 4000400040004000). In contrast, as shown in Figure 1(d), 3D-RPE employs both (mn)𝑚𝑛(m-n)( italic_m - italic_n ) and (φiφj)subscript𝜑𝑖subscript𝜑𝑗(\varphi_{i}-\varphi_{j})( italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), setting c=1000𝑐1000c=1000italic_c = 1000 to keep (mn)𝑚𝑛(m-n)( italic_m - italic_n ) within 1000100010001000, thereby preventing decay over longer distances. This method ensures E3drpesubscript𝐸3𝑑𝑟𝑝𝑒E_{3d-rpe}italic_E start_POSTSUBSCRIPT 3 italic_d - italic_r italic_p italic_e end_POSTSUBSCRIPT stays at or above 5555 for all relative positions.

Refer to caption
Figure 3: Visualization of the Relative Position Matrix 𝑨𝑨\bm{A}bold_italic_A employing 3D-RPE, with chunk size c𝑐citalic_c=4444, and sequence size L𝐿Litalic_L=12121212. The matrix elements Ai,jsubscript𝐴𝑖𝑗A_{i,j}italic_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT represents the relative position between the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT query vector 𝒒𝒒\bm{q}bold_italic_q and the jthsubscript𝑗𝑡j_{th}italic_j start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT key vector 𝒌𝒌\bm{k}bold_italic_k.

3.2.2 Enhanced Positional Resolution

Position Interpolation (PI) [4] has been introduced to scale down the position indices to align with the original window size, resulting in enhanced outcomes for context extension. However, as the extension length and interpolation increase, PI can lead to a reduction in relative positional resolution. 3D-RPE can be used alongside PI for long-context extensions. Compared to RoPE combined with PI, 3D-RPE has the advantage of mitigating the reduction in positional resolution caused by positional interpolation, as demonstrated in Theorem 1.

Theorem 1 (Enhanced Position Resolution).

For a pre-trained language model with a length of Lpsubscript𝐿𝑝L_{p}italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and an extension length requirement of L𝐿Litalic_L, employing linear position interpolation extension methods \mathcal{I}caligraphic_I based on Rotary Position Encoding (RoPE) can elevate the relative positional resolution from ropesubscript𝑟𝑜𝑝𝑒\mathcal{E}_{rope}caligraphic_E start_POSTSUBSCRIPT italic_r italic_o italic_p italic_e end_POSTSUBSCRIPT to ropesuperscriptsubscript𝑟𝑜𝑝𝑒\mathcal{E}_{rope}^{\prime}caligraphic_E start_POSTSUBSCRIPT italic_r italic_o italic_p italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Let 3drpesuperscriptsubscript3𝑑𝑟𝑝𝑒\mathcal{E}_{3d-rpe}^{\prime}caligraphic_E start_POSTSUBSCRIPT 3 italic_d - italic_r italic_p italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denote the relative positional encoding resolution achieved by the method \mathcal{I}caligraphic_I based on 3D-RPE, with chunk size c3𝑐3c\geq 3italic_c ≥ 3, there is:

3drpe>ropesuperscriptsubscript3𝑑𝑟𝑝𝑒superscriptsubscript𝑟𝑜𝑝𝑒\mathcal{E}_{3d-rpe}^{\prime}>\mathcal{E}_{rope}^{\prime}caligraphic_E start_POSTSUBSCRIPT 3 italic_d - italic_r italic_p italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > caligraphic_E start_POSTSUBSCRIPT italic_r italic_o italic_p italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (8)

The Proof of Theorem 1 is provided in Supplementary Materials C.

To empirically validate the superior performance of this benefit in a training-free setting, it has been observed that methods combining RoPE with interpolation lead to a significant increase in Perplexity as the modeling length increases in language modeling tasks. Conversely, the increase in Perplexity is substantially smaller when employing 3D-RPE with linear interpolation (Refer to Table 4 in Section 5). This phenomenon indicates that this benefit has led to an improvement in the performance of long sequence language modeling.

4 Related Work

This section provides an overview of the extensive literature on position encoding in Transformers [26] and discusses context extending capabilities based on RoPE.

Position Encoding (PE): PE is important for Transformer-based language models. Earlier studies [22, 21, 28, 23] have focused on enhancing the original absolute position encoding to develop better relative position encoding, thereby improving the text modeling capabilities of language models. These works [22, 21, 28] utilized trainable position vector encoding to directly incorporate positional information into context representations. Although effective, these methods typically add positional information to contextual representations, making them unsuitable for linear self-attention architectures. RoFormer [23] introduced relative position information by rotating context representations, known as RoPE. Transformers utilizing RoPE have become a prevalent backbone in various LLM designs [24, 8, 27, 16]. Our proposed 3D-RPE differs from the two-dimensional space of RoPE by modeling the relative position information of tokens through rotation on the Bloch Sphere.

Long-context LLMs based on RoPE: To enhance the contextual capabilities of Large Language Models (LLMs) using RoPE, several positional encoding interpolation techniques have been developed. These include Linear Position Interpolation (LPI) [4], Neural Tangent Kernel (NTK) [17], and Yet Another Recurrent Network (YaRN) [18] interpolation. Position Sequence Tuning (PoSE) [31] has notably increased sequence lengths to 128k128𝑘128k128 italic_k by amalgamating these positional interpolation strategies. Additionally, LongLora [5] introduced the shift-short attention mechanism, allowing for effective emulation of full attention and extending sequences up to 100k100𝑘100k100 italic_k, leveraging the LLMa-2-7B model and LoRA’s fine-tuning approach [12]. 3D-RPE further strengthens the positional relationships between distant tokens by capturing inter-chunk positional information and is compatible with existing fine-tuning techniques like LoRA to bolster long-context representation. The Dual Chunk Attention (DCA) [2] method, which enhances the use of pre-trained integer-based parameters, splits query and key sequences into chunks and uses three specialized matrices to capture the relative positions within and between these chunks. This method enhances the model’s ability to process longer sequences, but it is unable to model the relative positions within distant chunks. In our work, we employ rotating positional encoding to link attention across different chunks.

5 Experiments

We evaluate the method of position encoding, 3D-RPE, on LLaMA2 [24] models (specifically, LLaMA-2-7B and LLaMA-2-7B-chat), which have a 4k4𝑘4k4 italic_k pre-training context, and LLaMA-3-8B-Instruct 111https://github.com/meta-llama/llama3, which has an 8k8𝑘8k8 italic_k pre-training context. Our experiments aim to explore the following aspects: 1) The effect of 3D-RPE on long-context generation can be assessed using Perplexity. 2) The impact of 3D-RPE on long-context understanding and generation tasks, can be reflected by the accuracy of long sequence natural language tasks, e.g., multiply documents QA. 3) Ablation studies to confirm the advantages of 3D-RPE in position interpolation.

5.1 Experimental Settings

In this section, we elaborate on the experimental setup by introducing two types of tasks (i.e., long-context language understanding and long sequence language modeling) and detailing three aspects of the configuration (i.e., training parameters, training and evaluation datasets, and baseline models).

Training Setting: For long-context Natural Language Understanding (NLU) tasks, we have fine-tuned LLaMA-2-7B-chat and LLaMA-3-8B-Instruct. The context length for these models has been extended from 4k4𝑘4k4 italic_k to 16k16𝑘16k16 italic_k and from 8k8𝑘8k8 italic_k to 16k16𝑘16k16 italic_k, respectively. The fine-tuning method follows the fine-tuning strategy of LongChat [13]. The training step is 3,00030003,0003 , 000. For the long-sequence Language Modeling (LM) tasks, we have fine-tuned LLaMA-2-7B to support extended context length of 32k32𝑘32k32 italic_k tokens. The training step is 1,00010001,0001 , 000. We set the per-device batch size as 1111, and gradient accumulation step as 8888, which means that the batch size is 8888. We train the model with the next token prediction objective with LoRA [12].

We employed the AdamW optimizer [15] with β1=0.9subscript𝛽10.9{\beta}_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and β2=0.95subscript𝛽20.95{\beta}_{2}=0.95italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.95 for all fine-tuned models. Chunk size is set to 3k3𝑘3k3 italic_k. The learning rate was set to 2×1052superscript1052\times 10^{-5}2 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, and a linear learning rate warmup was applied. Training was conducted on a single 4xA800 GPU machine using FlashAttention-2 [10].

Datasets: In the context of long-context NLU tasks, we employ the LongAlpaca-12k dataset, which contains 9,000 LongQA and 3,000 short QA entries [6], and the LongAlpace-16k-length dataset222https://github.com/dvlab-research/LongLoRA/. To evaluate the performance of 3D-RPE for long-context extension, we use the LongBench [3], which includes 13131313 English tasks, 5555 Chinese tasks and 2222 code tasks, with most tasks having an average context length of 5k5𝑘5k5 italic_k to 15k15𝑘15k15 italic_k tokens. We focus on the English and code tasks to evaluate our method, 3D-RPE. Additionally, the LEval [1] evaluation set, which also consists of long-context datasets, is used to verify the effectiveness of 3D-RPE. The five datasets annotated from scratch in LEval, namely Coursera, QuALiTY, CodeU, GSM,and TOEFL, are utilized.

For long-sequence LM tasks, we use the RedPajama-Data [9] for fine-tuning training. The dataset is a large-scale pre-training dataset (the size reaches 1.2 trillion tokens) designed to provide high-quality training data for language models, and contains multiple data sources (i.e., github, arxiv, book, c4 and Wikipedia, etc.). We sample 20,0002000020,00020 , 000 samples from these data sources for training. For evaluation, we utilize the PG19 book corpus dataset [20], which includes 100 documents, and the Arxiv Math Proof-pile dataset (test split). Additionally, all methods evaluate perplexity by using a sliding window following [19].

Baseline models: For long-context NLU tasks, the fine-tuned models, including LongAlpace-16k [5], LongChat-32k [14] LongLlama [25] and ChatGLM [11] are used as the baseline models. Models of fine-tuning free in language modeling tasks are also used in long-context NLU tasks.

In long sequence LM tasks, the methods of LongLoRA [5], StreamingLLM [29], Positional Interpolation(PI) [4] and the NTK-Aware Scale RoPE(NTK) [17] are selected as the baselines, all based on the LLaMA-2-7B-base model. Among these baseline models, PI, NTK and StreamingLLM are fine-tuning-free methods. The fine-tuned models include LongLoRA and Activation Beacon [30]. In Ablation experiments, interpolation methods without training are used as baseline models, which are PI and NTK.

Table 1: Comparison between open-source based models on long-context NLU tasks. Our model, 3D-RPE-LlaMA2-7B-Chat is fine-tuning on LLaMA-2-7b-chat, which is extended from 4k4𝑘4k4 italic_k to 16k16𝑘16k16 italic_k context lengths. Baseline models can be categorized into two groups: those that necessitate fine-tuning during training (such as LongAlpaca [5] and LongLLaMA [25]), and those that do not require it (including PI, NTK, StreamingLLM [29], and ChunkLLaMA-16k16𝑘16k16 italic_k [2]). The experimental results for each specific evaluation set in Supplementary Material D.2.
Methods Single-Doc QA Multi-Doc QA Summarization Few-shot Code
LLaMA-2-7B-chat 24.90 22.60 24.70 60.01 48.10
LLaMA-2-7B-chat-PI 18.98 17.16 25.03 49.43 52.73
LLaMA-2-7B-chat-NTK 23.21 23.34 24.40 59.29 49.28
StreamingLLM 21.47 22.22 22.20 50.05 48.00
ChunkLLaMA-16k16𝑘16k16 italic_k 24.04 22.98 21.52 46.31 49.73
LongChat-32k32𝑘32k32 italic_k 31.58 23.50 26.70 64.02 54.10
LongAlpaca-16k16𝑘16k16 italic_k 28.70 28.10 27.80 63.70 56.00
LongLLaMA 30.12 16.37 24.19 60.31 66.05
Vicuna-v1.5-7B-16k16𝑘16k16 italic_k 28.01 18.63 26.01 66.20 47.30
ChatGLM3-6B-32k32𝑘32k32 italic_k 40.30 46.60 29.50 68.10 56.20
3D-RPE-LLaMA2-7B-Chat 47.40 60.10 28.99 73.16 76.50
Table 2: Comparison with open-source models, LLaMA-2-7B-chat, LLaMA3-8B-Instruct, on 5 closed-ended-ended tasks with various input length from LEval [1]. The evaluation metric “EM,” which represents the exact match score, is adopted. * indicates the model is train-free.
Models Coursera QuALiTY CodeU GSM TOEFL
LLaMA-2-7B-Chat 29.21 37.62 1.11 19.00 51.67
LongChat-7B-16K 29.74 33.66 3.33 10.00 47.95
LLaMA2-7B-NTK 32.71 33.16 0.00 19.00 52.78
Vicuna1.5-7B-16k 38.66 39.60 5.55 19.00 55.39
3D-RPE-LLaMA2-7B-Chat(ours) 39.38 38.11 2.22 21.01 57.99
LLaMA3-8B-Instruct* 51.45 64.34 4.44 76.00 82.89
3D-RPE-LLaMA3-8B-Instruct* 51.89 61.38 4.44 80.00 82.89
Table 3: Perplexity evaluation on different extending methods. We conduct evaluation on the Proof-pile and PG-19 test datasets, varying evaluation context window size from 8k8𝑘8k8 italic_k to 100k100𝑘100k100 italic_k.
Methods PG-19 Proof-Pile
8k8𝑘8k8 italic_k 16k16𝑘16k16 italic_k 32k32𝑘32k32 italic_k 100k100𝑘100k100 italic_k 8k8𝑘8k8 italic_k 16k16𝑘16k16 italic_k 32k32𝑘32k32 italic_k 100k100𝑘100k100 italic_k
LLaMA2-7B-Base 131.09 >102absentsuperscript102>10^{2}> 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT >102absentsuperscript102>10^{2}> 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT OOM 16.79 >102absentsuperscript102>10^{2}> 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT >102absentsuperscript102>10^{2}> 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT OOM
LLama2-7B-PI 11.32 19.5 >102absentsuperscript102>10^{2}> 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT OOM 3.86 5.94 33.7 OOM
LLama2-7B-NTK 10.28 11.5 37.8 OOM 3.98 5.94 33.7 OOM
StreamingLLM 9.23 9.25 9.24 9.32 3.47 3.51 3.50 3.55
LongLoRA-32k 7.33 7.16 7.04 2.78 2.61 2.50
LongLoRA-100k 7.57 7.33 7.16 7.04 2.78 2.60 2.58 2.52
LongChat-32k 8.92 8.85 8.81 OOM 2.98 2.70 2.65 OOM
Activation Beacon 8.52 8.54 8.56 8.68 3.45 3.42 3.39 3.35
3D-RPE-LLaMA2-7B 7.03 7.10 8.09 8.12 2.72 2.93 2.89 3.05
Table 4: Results are evaluated in Perplexity on PG19 validation split. ’*’ denotes train-free, implementing 3D-RPE directly on the LLaMA2-7B-Base model without additional fine-tuning, utilizing a chunk size of 3k3𝑘3k3 italic_k. The context length of 8k8𝑘8k8 italic_k is extended directly with 3D-RPE. Achieving 16k16𝑘16k16 italic_k and 32k32𝑘32k32 italic_k is accomplished through linear positional interpolation with chunks based on the 8k8𝑘8k8 italic_k context length.
Models 4k4𝑘4k4 italic_k 8k8𝑘8k8 italic_k 16k16𝑘16k16 italic_k 32k32𝑘32k32 italic_k
LLaMA2-7B-PI 7.94 9.19 15.11 >102absentsuperscript102>10^{2}> 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
LLaMA2-7B-NTK 7.87 11.98 26.12 58.91
LLaMA2-7B-Yarn 7.87 8.06 9.82 11.74
3D-RPE-LLaMA2-7B* 7.87 7.90 7.71 9.34

5.2 Long-Context Natural Language Understanding

In this task, the LongBench [3] evaluation set was initially utilized. Five categories of tasks were included: single-document QA (3 tasks), multi-document QA (3 tasks), summarization (3 tasks), few-shot learning (3 tasks), and code completion (2 tasks). The average score for each type is reported in Table 1. The evaluation metrics followed those specified in LongBench [3], which differ across tasks and are detailed in Supplementary Material D.1. The results in Table 1 highlight our model’s significant performance advantages over baseline models in four tasks, both for models without training and those with fine-tuning. In summarization tasks, our model also achieved performance comparable to ChatGLM3-6B-32k32𝑘32k32 italic_k. These experimental outcomes indicate that our model enhances the correlation between tokens with distant relative positions in long contexts through 3D-RPE, resulting in improved performance.

Subsequently, the LEval Benchmark [1] was employed. Table 2 reveals that our model, 3D-RPE-LLaMA2-7B-Chat, outperformed LLaMA2-7B-NTK and LongChat-7B-16K16𝐾16K16 italic_K. Although it did not surpass Vicuna1.5-7B-16K16𝐾16K16 italic_K in Quality and CodeU tasks, it excelled in the Coursera, GSM, and TOEFL tasks. Additionally, we conducted experiments on LLaMA3-8B-Instruct using a 16k context window with 3D-RPE. The 3D-RPE-LLaMA3-8B-Instruct* showed performance improvements in the Coursera and GSM tasks. While 3D-RPE did not enhance performance in the CodeU, TOEFL, and QuALiTY tasks, there was no significant performance decline either. These experimental results demonstrate the effectiveness of the 3D-RPE method.

5.3 Long-Sequence Language Modeling

In Table 3, we present the perplexity scores for our model, 3D-RPE-LLaMA-2-7B and baseline models on the proof-pile and PG19 test datasets. 3D-RPE-LLaMA-2-7B was fine-tuned from the LLaMA2-7B-Base model using a dataset with a 32k32𝑘32k32 italic_k context window. To evaluate performance, we set sequence lengths of 8k8𝑘8k8 italic_k, 16k16𝑘16k16 italic_k, and 32k32𝑘32k32 italic_k. We extended our model’s sequence length from 32k32𝑘32k32 italic_k to 100k100𝑘100k100 italic_k using the position extending method from PoSE [31]. The results indicate that our method outperforms train-free sequence extending models. Compared to fine-tuned models, our model shows better performance at 8k8𝑘8k8 italic_k and 16k16𝑘16k16 italic_k sequence lengths. This suggests that the new positional encoding, 3D-RPE, improves or maintains modeling performance for larger context windows (32k32𝑘32k32 italic_k) compared to smaller ones (8k8𝑘8k8 italic_k and 16k16𝑘16k16 italic_k). For the 32k32𝑘32k32 italic_k and 100k100𝑘100k100 italic_k tasks, although our model did not surpass LongLoRA-32k32𝑘32k32 italic_k and LongLoRA-100k100𝑘100k100 italic_k, it did outperform LongChat-32k32𝑘32k32 italic_k and Activation Beacon.

Notably, our model can further extend from 32k32𝑘32k32 italic_k to 100k100𝑘100k100 italic_k without significantly increasing perplexity values, in combination with other train-free extension methods. However, due to its specific attention mechanism, the LongLoRA models cannot be extended beyond their predefined context windows in a train-free manner. For instance, LongLoRA-32k32𝑘32k32 italic_k cannot be further extended to 100k100𝑘100k100 italic_k.

5.4 Ablation Study

In this section, we conduct ablation studies in this section to explore how 3D-RPE affects the linear interpolation method. We compare position interpolation methods (PI, NTK, and Yarn) with the method that combines 3D-RPE with position interpolation on the LLaMA-2-7B-Base model in a train-free manner. The experimental results can be found in Table 2. The 3D-RPE-LLaMA2-7B* model with linearly positional interpolation from 8k8𝑘8k8 italic_k to 16k16𝑘16k16 italic_k and 32k32𝑘32k32 italic_k, the 3D-RPE approach yields improved results by mitigating the decrease in positional resolution caused by interpolation methods. These results are consistent with the findings of Theorem 1 in Section 3.2.2 presented in this paper.

6 Conclusion and Future Work

In this paper, we present a novel rotary position encoding method called 3D Rotary Position Encoding (3D-RPE). Compared to RoPE, we have theoretically proved that 3D-RPE possesses two key advantages: controllable long-term decay and enhanced interpolation resolution. Experimentally, 3D-RPE has demonstrated outstanding performance in long-context Natural Language Understanding.

In the future, 3D-RPE holds promise as a foundational positional encoding strategy for LLMs, especially in the aspect of modeling long contexts. Moreover, given that 3D-RPE encapsulates positional encoding within a three-dimensional framework, it has the potential to integrate with visual data, thereby facilitating an in-depth exploration of its efficacy in synchronizing graphical and textual semantic information.

References

  • [1] Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088, 2023.
  • [2] Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong. Training-free long-context scaling of large language models, 2024.
  • [3] Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
  • [4] Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
  • [5] Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient fine-tuning of long-context large language models. arXiv:2309.12307, 2023.
  • [6] Yukang Chen, Shaozuo Yu, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Long alpaca: Long-context instruction-following models. https://github.com/dvlab-research/LongLoRA, 2023.
  • [7] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, march 2023. URL https://lmsys. org/blog/2023-03-30-vicuna, 3(5), 2023.
  • [8] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  • [9] Together Computer. Redpajama: An open source recipe to reproduce llama training dataset. https://github.com/togethercomputer/RedPajama-Data, 2023.
  • [10] Tri Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. CoRR, 2023.
  • [11] Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022.
  • [12] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  • [13] Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. How long can context length of open-source llms truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023.
  • [14] Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. How long can open-source llms truly promise on context length? arXiv preprint arXiv:2306.04537, June 2023.
  • [15] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019, 2019.
  • [16] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
  • [17] Bowen Peng and Jeffrey Quesnelle. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have, 2023.
  • [18] Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2203.13474, 2023.
  • [19] Oriol Press, Noah A Smith, and Michael Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In ICLR, 2022, 2022.
  • [20] Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, Chloe Hillier, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations, 2020.
  • [21] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
  • [22] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 464–468, 2018.
  • [23] Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  • [24] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • [25] Szymon Tworkowski, Konrad Staniszewski, Mikołaj Pacek, Yuhuai Wu, Henryk Michalewski, and Piotr Miłos. Focused transformer: Contrastive training for context scaling. arXiv preprint arXiv:2307.03170, 2023.
  • [26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • [27] Ben Wang and Aran Komatsuzaki. Gpt-j-6b: A 6 billion parameter autoregressive language model. GitHub, 2021.
  • [28] Benyou Wang, Donghao Zhao, Christina Lioma, Qiuchi Li, Peng Zhang, and Jakob Grue Simonsen. Encoding word order in complex embeddings. In International Conference on Learning Representations, 2020.
  • [29] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv, 2023.
  • [30] Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, and Zhicheng Dou. Soaring from 4k to 400k: Extending llm’s context with activation beacon, 2024.
  • [31] Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li. Pose: Efficient context window extension of llms via positional skip-wise training. arXiv preprint arXiv:2309.10400, 2023.

Appendix A Bloch Sphere

Refer to caption
Figure 4: A diagram of Bloch Sphere.

Bloch Sphere: 3D Rotary Position Encoding (3D-RPE), proposed by us, corresponds to a Bloch Sphere. In this section, we mainly introduce the basic concept of Bloch Sphere, which corresponds to Eq. (1) in this paper.

The Bloch Sphere is a geometric tool to used to represent qubits, typically depicted in a three-dimensional polar coordinate system as a point on the Sphere (see Figure 4). A single quantum state is represented by the following equation in linear algebra:

|ϕ=α|0+β|1ketitalic-ϕ𝛼ket0𝛽ket1\ket{\phi}=\alpha\ket{0}+\beta\ket{1}| start_ARG italic_ϕ end_ARG ⟩ = italic_α | start_ARG 0 end_ARG ⟩ + italic_β | start_ARG 1 end_ARG ⟩ (9)

where α𝛼\alphaitalic_α and β𝛽\betaitalic_β are complex numbers, i.e., α,β𝛼𝛽\alpha,\beta\in\mathbb{C}italic_α , italic_β ∈ blackboard_C. According to Euler’s formula in complex analysis, the coefficients α𝛼\alphaitalic_α and β𝛽\betaitalic_β can be reexpressed as:

α=a+bi=r0(cos(θα)+isin(θα))=r0eiθα𝛼𝑎𝑏isubscript𝑟0𝑐𝑜𝑠subscript𝜃𝛼i𝑠𝑖𝑛subscript𝜃𝛼subscript𝑟0superscript𝑒isubscript𝜃𝛼\displaystyle\alpha=a+b\mathrm{i}=r_{0}(cos(\theta_{\alpha})+\mathrm{i}sin(% \theta_{\alpha}))=r_{0}e^{\mathrm{i}\theta_{\alpha}}italic_α = italic_a + italic_b roman_i = italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_c italic_o italic_s ( italic_θ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) + roman_i italic_s italic_i italic_n ( italic_θ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) ) = italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT roman_i italic_θ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (10)
β=c+di=r1(cos(θβ)+isin(θβ))=r1eiθβ𝛽𝑐𝑑isubscript𝑟1𝑐𝑜𝑠subscript𝜃𝛽i𝑠𝑖𝑛subscript𝜃𝛽subscript𝑟1superscript𝑒isubscript𝜃𝛽\displaystyle\beta=c+d\mathrm{i}=r_{1}(cos(\theta_{\beta})+\mathrm{i}sin(% \theta_{\beta}))=r_{1}e^{\mathrm{i}\theta_{\beta}}italic_β = italic_c + italic_d roman_i = italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_c italic_o italic_s ( italic_θ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ) + roman_i italic_s italic_i italic_n ( italic_θ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ) ) = italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT roman_i italic_θ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

By substituting Eq. (10) into Eq. (9), the quantum state representation is denoted as:

|ϕketitalic-ϕ\displaystyle\ket{\phi}| start_ARG italic_ϕ end_ARG ⟩ =r0(cos(θα)+isin(θα))|0+r1(cos(θβ)+isin(θβ))|1absentsubscript𝑟0𝑐𝑜𝑠subscript𝜃𝛼i𝑠𝑖𝑛subscript𝜃𝛼ket0subscript𝑟1𝑐𝑜𝑠subscript𝜃𝛽i𝑠𝑖𝑛subscript𝜃𝛽ket1\displaystyle=r_{0}(cos(\theta_{\alpha})+\mathrm{i}sin(\theta_{\alpha}))\ket{0% }+r_{1}(cos(\theta_{\beta})+\mathrm{i}sin(\theta_{\beta}))\ket{1}= italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_c italic_o italic_s ( italic_θ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) + roman_i italic_s italic_i italic_n ( italic_θ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) ) | start_ARG 0 end_ARG ⟩ + italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_c italic_o italic_s ( italic_θ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ) + roman_i italic_s italic_i italic_n ( italic_θ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ) ) | start_ARG 1 end_ARG ⟩ (11)
=r0eiθα|0+r1eiθβ|1absentsubscript𝑟0superscript𝑒isubscript𝜃𝛼ket0subscript𝑟1superscript𝑒isubscript𝜃𝛽ket1\displaystyle=r_{0}e^{\mathrm{i}\theta_{\alpha}}\ket{0}+r_{1}e^{\mathrm{i}% \theta_{\beta}}\ket{1}= italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT roman_i italic_θ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | start_ARG 0 end_ARG ⟩ + italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT roman_i italic_θ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | start_ARG 1 end_ARG ⟩
=eiθα(r0|0+r1ei(θβθα)|1)absentsuperscript𝑒isubscript𝜃𝛼subscript𝑟0ket0subscript𝑟1superscript𝑒isubscript𝜃𝛽subscript𝜃𝛼ket1\displaystyle=e^{\mathrm{i}\theta_{\alpha}}(r_{0}\ket{0}+r_{1}e^{\mathrm{i}(% \theta_{\beta}-\theta_{\alpha})}\ket{1})= italic_e start_POSTSUPERSCRIPT roman_i italic_θ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | start_ARG 0 end_ARG ⟩ + italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT roman_i ( italic_θ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT | start_ARG 1 end_ARG ⟩ )

θαsubscript𝜃𝛼\theta_{\alpha}italic_θ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT is the global phase.

Considering the normalization condition |α|2+|β|2=1superscript𝛼2superscript𝛽21|\alpha|^{2}+|\beta|^{2}=1| italic_α | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + | italic_β | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1, we have:

|r0|2+|r1ei(θβθα)|2=r02+r12|ei(θβθα)|2=r02+r12=1superscriptsubscript𝑟02superscriptsubscript𝑟1superscript𝑒isubscript𝜃𝛽subscript𝜃𝛼2superscriptsubscript𝑟02superscriptsubscript𝑟12superscriptsuperscript𝑒isubscript𝜃𝛽subscript𝜃𝛼2superscriptsubscript𝑟02superscriptsubscript𝑟121|r_{0}|^{2}+|r_{1}e^{\mathrm{i}(\theta_{\beta}-\theta_{\alpha})}|^{2}=r_{0}^{2% }+r_{1}^{2}|e^{\mathrm{i}(\theta_{\beta}-\theta_{\alpha})}|^{2}=r_{0}^{2}+r_{1% }^{2}=1| italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + | italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT roman_i ( italic_θ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_e start_POSTSUPERSCRIPT roman_i ( italic_θ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1 (12)

Given r0=cosφ2subscript𝑟0𝑐𝑜𝑠𝜑2r_{0}=cos\frac{\varphi}{2}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_c italic_o italic_s divide start_ARG italic_φ end_ARG start_ARG 2 end_ARG, r1=sinφ2subscript𝑟1𝑠𝑖𝑛𝜑2r_{1}=sin\frac{\varphi}{2}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_s italic_i italic_n divide start_ARG italic_φ end_ARG start_ARG 2 end_ARG, θα=θsubscript𝜃𝛼𝜃\theta_{\alpha}=\thetaitalic_θ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = italic_θ and θβθα=θ1subscript𝜃𝛽subscript𝜃𝛼subscript𝜃1\theta_{\beta}-\theta_{\alpha}=\theta_{1}italic_θ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the state |ϕketitalic-ϕ\ket{\phi}| start_ARG italic_ϕ end_ARG ⟩ can be expressed as:

|ϕ=eiθ(cosφ2|0+sinφ2eiθ1|1)ketitalic-ϕsuperscript𝑒i𝜃𝑐𝑜𝑠𝜑2ket0𝑠𝑖𝑛𝜑2superscript𝑒isubscript𝜃1ket1\ket{\phi}=e^{\mathrm{i}\theta}(cos\frac{\varphi}{2}\ket{0}+sin\frac{\varphi}{% 2}e^{\mathrm{i}\theta_{1}}\ket{1})| start_ARG italic_ϕ end_ARG ⟩ = italic_e start_POSTSUPERSCRIPT roman_i italic_θ end_POSTSUPERSCRIPT ( italic_c italic_o italic_s divide start_ARG italic_φ end_ARG start_ARG 2 end_ARG | start_ARG 0 end_ARG ⟩ + italic_s italic_i italic_n divide start_ARG italic_φ end_ARG start_ARG 2 end_ARG italic_e start_POSTSUPERSCRIPT roman_i italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | start_ARG 1 end_ARG ⟩ ) (13)

Therefore, the Eq. (1) of this paper is given out.

To adapt to the original 2D rotation position encoding (RoPE) of pre-trained LLMs, such as LLaMA models, the global phase θ𝜃\thetaitalic_θ is used to model the relative positions between tokens within a chunk, while the rotation angle φ2𝜑2\frac{\varphi}{2}divide start_ARG italic_φ end_ARG start_ARG 2 end_ARG is used to model the relative positions between tokens across chunks.

Appendix B Supplementary Material for the Method Section

In this section, we mainly introduce the specific implementation of our positional encoding method (3D-RPE), and the formula derivation details of attention score calculation (Eq. (5) of this paper) not detailed in this paper.

B.1 Implement of 3D-RPE

In Section 3.1, we give the general form of 3D Rotary Position Encoding (3D-RPE):

𝒉~j,m=eimθ(cosφj𝒉j,m+sinφj𝒉j,m)subscript~𝒉𝑗𝑚superscript𝑒i𝑚𝜃subscript𝜑𝑗superscriptsubscript𝒉𝑗𝑚perpendicular-tosubscript𝜑𝑗subscript𝒉𝑗𝑚\widetilde{\bm{h}}_{j,m}=e^{\mathrm{i}m\theta}(\cos{\varphi_{j}}\bm{h}_{j,m}^{% \perp}+\sin{\varphi_{j}}\bm{h}_{j,m})over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT = italic_e start_POSTSUPERSCRIPT roman_i italic_m italic_θ end_POSTSUPERSCRIPT ( roman_cos italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT + roman_sin italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT )

cosφjsubscript𝜑𝑗\cos{\varphi_{j}}roman_cos italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and sinφjsubscript𝜑𝑗\sin{\varphi_{j}}roman_sin italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are scalar quantities in \mathbb{R}blackboard_R. 𝒉j,msuperscriptsubscript𝒉𝑗𝑚perpendicular-to\bm{h}_{j,m}^{\perp}bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT and 𝒉j,msubscript𝒉𝑗𝑚\bm{h}_{j,m}bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT are shown below:

𝒉j,m=[h0h1hd/22hd/21hd/2hd/2+1hd2hd1]𝒉j,m=[hd/2hd/2+1hd2hd1h0h1hd/22hd/21]formulae-sequencesubscript𝒉𝑗𝑚matrixsuperscript0superscript1superscript𝑑22superscript𝑑21superscript𝑑2superscript𝑑21superscript𝑑2superscript𝑑1superscriptsubscript𝒉𝑗𝑚perpendicular-tomatrixsuperscript𝑑2superscript𝑑21superscript𝑑2superscript𝑑1superscript0superscript1superscript𝑑22superscript𝑑21\bm{h}_{j,m}=\begin{bmatrix}h^{0}\\ h^{1}\\ \vdots\\ h^{d/2-2}\\ h^{d/2-1}\\ h^{d/2}\\ h^{d/2+1}\\ \vdots\\ h^{d-2}\\ h^{d-1}\\ \end{bmatrix}\quad\quad\quad\quad\quad\quad\bm{h}_{j,m}^{\perp}=\begin{bmatrix% }-h^{d/2}\\ -h^{d/2+1}\\ \vdots\\ -h^{d-2}\\ -h^{d-1}\\ h^{0}\\ h^{1}\\ \vdots\\ h^{d/2-2}\\ h^{d/2-1}\\ \end{bmatrix}bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_h start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_h start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_h start_POSTSUPERSCRIPT italic_d / 2 - 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_h start_POSTSUPERSCRIPT italic_d / 2 - 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_h start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_h start_POSTSUPERSCRIPT italic_d / 2 + 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_h start_POSTSUPERSCRIPT italic_d - 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_h start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT = [ start_ARG start_ROW start_CELL - italic_h start_POSTSUPERSCRIPT italic_d / 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL - italic_h start_POSTSUPERSCRIPT italic_d / 2 + 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL - italic_h start_POSTSUPERSCRIPT italic_d - 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL - italic_h start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_h start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_h start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_h start_POSTSUPERSCRIPT italic_d / 2 - 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_h start_POSTSUPERSCRIPT italic_d / 2 - 1 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] (14)

In the concrete implementation, analogous to RoPE, for each two-dimensional subspace 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT of dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, we assign angles θl=base2l/dsubscript𝜃𝑙𝑏𝑎𝑠superscript𝑒2𝑙𝑑\theta_{l}=base^{-2l/d}italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_b italic_a italic_s italic_e start_POSTSUPERSCRIPT - 2 italic_l / italic_d end_POSTSUPERSCRIPT that vary from high to low frequencies. An equivalent rotation matrix mθsuperscriptsubscript𝑚𝜃\mathcal{R}_{m}^{\theta}caligraphic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT is utilized to substitute for eimθsuperscript𝑒i𝑚𝜃e^{\mathrm{i}m\theta}italic_e start_POSTSUPERSCRIPT roman_i italic_m italic_θ end_POSTSUPERSCRIPT:

mθ=[cosmθ000sinmθ0000cosmθ100sinmθ1000cosmθd/2100sinmθd/21sinmθ000cosmθ0000sinmθ100cosmθ1000sinmθd/2100cosmθd/21]subscriptsuperscript𝜃𝑚delimited-[]𝑚subscript𝜃000𝑚subscript𝜃0000𝑚subscript𝜃100𝑚subscript𝜃1000𝑚subscript𝜃𝑑2100𝑚subscript𝜃𝑑21𝑚subscript𝜃000𝑚subscript𝜃0000𝑚subscript𝜃100𝑚subscript𝜃1000𝑚subscript𝜃𝑑2100𝑚subscript𝜃𝑑21\mathcal{R}^{\theta}_{m}={\left[\begin{array}[]{cccccccc}\cos m\theta_{0}&0&% \cdots&0&-\sin m\theta_{0}&0&\cdots&0\\ 0&\cos m\theta_{1}&\cdots&0&0&-\sin m\theta_{1}&\cdots&0\\ \vdots&\vdots&\ddots&\vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&\cos m\theta_{d/2-1}&0&0&\cdots&-\sin m\theta_{d/2-1}\\ \sin m\theta_{0}&0&\cdots&0&\cos m\theta_{0}&0&\cdots&0\\ 0&\sin m\theta_{1}&\cdots&0&0&\cos m\theta_{1}&\cdots&0\\ \vdots&\vdots&\ddots&\vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&\sin m\theta_{d/2-1}&0&0&\cdots&\cos m\theta_{d/2-1}\\ \end{array}\right]}caligraphic_R start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = [ start_ARRAY start_ROW start_CELL roman_cos italic_m italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL - roman_sin italic_m italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL roman_cos italic_m italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - roman_sin italic_m italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL roman_cos italic_m italic_θ start_POSTSUBSCRIPT italic_d / 2 - 1 end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL - roman_sin italic_m italic_θ start_POSTSUBSCRIPT italic_d / 2 - 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL roman_sin italic_m italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL roman_cos italic_m italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL roman_sin italic_m italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL roman_cos italic_m italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL roman_sin italic_m italic_θ start_POSTSUBSCRIPT italic_d / 2 - 1 end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL roman_cos italic_m italic_θ start_POSTSUBSCRIPT italic_d / 2 - 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] (15)

Therefore, Eq.(4) of this paper can be transformed to

𝒉~j,m=mθ(cosφj𝒉j,m+sinφj𝒉j,m)subscript~𝒉𝑗𝑚subscriptsuperscript𝜃𝑚subscript𝜑𝑗superscriptsubscript𝒉𝑗𝑚perpendicular-tosubscript𝜑𝑗subscript𝒉𝑗𝑚\widetilde{\bm{h}}_{j,m}=\mathcal{R}^{\theta}_{m}(\cos{\varphi_{j}}\bm{h}_{j,m% }^{\perp}+\sin{\varphi_{j}}\bm{h}_{j,m})over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT = caligraphic_R start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_cos italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT + roman_sin italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT )

where mθsubscriptsuperscript𝜃𝑚\mathcal{R}^{\theta}_{m}caligraphic_R start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is a design form equivalent to the rotation matrix in RoPE, mainly re-mapped to correspond to specific application implementations and calculation derivations in LLMs. In the specific implementation, after the rotary position encoding of LLMs, the long sequence is chunked based on the chunk size c𝑐citalic_c. Then, the rotation φjsubscript𝜑𝑗\varphi_{j}italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is set on each chunk, j𝑗jitalic_j is the position of chunk.

B.2 Derivation of Attention for 3D-RPE

The formula derivation details of attention score calculation(Eq. (5)) is as follows.

Since 𝒉=eiπ2𝒉=i𝒉superscript𝒉perpendicular-tosuperscript𝑒i𝜋2𝒉i𝒉\bm{h}^{\perp}=e^{\mathrm{i}\frac{\pi}{2}}\bm{h}=\mathrm{i}\bm{h}bold_italic_h start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT = italic_e start_POSTSUPERSCRIPT roman_i divide start_ARG italic_π end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT bold_italic_h = roman_i bold_italic_h, we could get:

𝒉~j,msubscript~𝒉𝑗𝑚\displaystyle\widetilde{\bm{h}}_{j,m}over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT =eimθ(icosφj𝒉j,m+sinφj𝒉j,m)absentsuperscript𝑒i𝑚𝜃isubscript𝜑𝑗subscript𝒉𝑗𝑚subscript𝜑𝑗subscript𝒉𝑗𝑚\displaystyle=e^{\mathrm{i}m\theta}(\mathrm{i}\cos{\varphi_{j}}\bm{h}_{j,m}+% \sin{\varphi_{j}}\bm{h}_{j,m})= italic_e start_POSTSUPERSCRIPT roman_i italic_m italic_θ end_POSTSUPERSCRIPT ( roman_i roman_cos italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT + roman_sin italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT ) (16)
=eimθ(isin(π2φj)𝒉j,m+cos(π2φj)𝒉j,m)absentsuperscript𝑒i𝑚𝜃i𝜋2subscript𝜑𝑗subscript𝒉𝑗𝑚𝜋2subscript𝜑𝑗subscript𝒉𝑗𝑚\displaystyle=e^{\mathrm{i}m\theta}(\mathrm{i}\sin{(\frac{\pi}{2}-\varphi_{j}}% )\bm{h}_{j,m}+\cos{(\frac{\pi}{2}-\varphi_{j}})\bm{h}_{j,m})= italic_e start_POSTSUPERSCRIPT roman_i italic_m italic_θ end_POSTSUPERSCRIPT ( roman_i roman_sin ( divide start_ARG italic_π end_ARG start_ARG 2 end_ARG - italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT + roman_cos ( divide start_ARG italic_π end_ARG start_ARG 2 end_ARG - italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT )
=eimθeiπ2φj𝒉j,mabsentsuperscript𝑒i𝑚𝜃superscript𝑒i𝜋2subscript𝜑𝑗subscript𝒉𝑗𝑚\displaystyle=e^{\mathrm{i}m\theta}e^{\mathrm{i}\frac{\pi}{2}-\varphi_{j}}\bm{% h}_{j,m}= italic_e start_POSTSUPERSCRIPT roman_i italic_m italic_θ end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT roman_i divide start_ARG italic_π end_ARG start_ARG 2 end_ARG - italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_j , italic_m end_POSTSUBSCRIPT

Let 𝒒i,msubscript𝒒𝑖𝑚\bm{q}_{i,m}bold_italic_q start_POSTSUBSCRIPT italic_i , italic_m end_POSTSUBSCRIPT=3d-PE(𝒒,i,m)𝒒𝑖𝑚(\bm{q},i,m)( bold_italic_q , italic_i , italic_m ), 𝒌j,nsubscript𝒌𝑗𝑛\bm{k}_{j,n}bold_italic_k start_POSTSUBSCRIPT italic_j , italic_n end_POSTSUBSCRIPT=3d-PE(𝒌,j,n)𝒌𝑗𝑛(\bm{k},j,n)( bold_italic_k , italic_j , italic_n ). Taking the real part of the inner product of 𝒒i,msubscript𝒒𝑖𝑚\bm{q}_{i,m}bold_italic_q start_POSTSUBSCRIPT italic_i , italic_m end_POSTSUBSCRIPT and 𝒌j,nsubscript𝒌𝑗𝑛\bm{k}_{j,n}bold_italic_k start_POSTSUBSCRIPT italic_j , italic_n end_POSTSUBSCRIPT yields:

s(𝒒i,m,𝒌j,n)=Re[ei(φiφj)l=0d/21ei(mn)θl(𝒒l𝒌l+𝒒d/2+l𝒌d/2+l)]𝑠subscript𝒒𝑖𝑚subscript𝒌𝑗𝑛𝑅𝑒delimited-[]superscript𝑒isubscript𝜑𝑖subscript𝜑𝑗superscriptsubscript𝑙0𝑑21superscript𝑒i𝑚𝑛subscript𝜃𝑙subscript𝒒𝑙subscript𝒌𝑙subscript𝒒𝑑2𝑙subscript𝒌𝑑2𝑙s(\bm{q}_{i,m},\bm{k}_{j,n})=Re[e^{\mathrm{i}(\varphi_{i}-\varphi_{j})}\sum% \limits_{l=0}^{d/2-1}e^{\mathrm{i}(m-n)\theta_{l}}(\bm{q}_{l}\bm{k}_{l}+\bm{q}% _{d/2+l}\bm{k}_{d/2+l})]italic_s ( bold_italic_q start_POSTSUBSCRIPT italic_i , italic_m end_POSTSUBSCRIPT , bold_italic_k start_POSTSUBSCRIPT italic_j , italic_n end_POSTSUBSCRIPT ) = italic_R italic_e [ italic_e start_POSTSUPERSCRIPT roman_i ( italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d / 2 - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT roman_i ( italic_m - italic_n ) italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_italic_q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT bold_italic_k start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + bold_italic_q start_POSTSUBSCRIPT italic_d / 2 + italic_l end_POSTSUBSCRIPT bold_italic_k start_POSTSUBSCRIPT italic_d / 2 + italic_l end_POSTSUBSCRIPT ) ] (17)

which is a function related to both mn𝑚𝑛m-nitalic_m - italic_n and φiφjsubscript𝜑𝑖subscript𝜑𝑗\varphi_{i}-\varphi_{j}italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

Appendix C 3D Rotary Position Encoding Resolution Enhancement

In this section, before proving Theorem 1, we first provide the definitions of positional resolution for RoPE, as well as the positional resolution after positional interpolation.

Definition 2 (Positional Interpolation Resolution).

Let 𝐪m+1subscript𝐪𝑚1\bm{q}_{m+1}bold_italic_q start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT and 𝐤msubscript𝐤𝑚\bm{k}_{m}bold_italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT be query state and key state of the m𝑚mitalic_m-th and (m+1)𝑚1{(m+1)}( italic_m + 1 )-th hidden states after RoPE. Given a pre-training length Lpsubscript𝐿𝑝L_{p}italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, the attention score a𝑎aitalic_a is:

a(𝒒m+1,𝒌m)=𝒒𝒌Teiθ𝑎subscript𝒒𝑚1subscript𝒌𝑚𝒒superscript𝒌𝑇superscript𝑒i𝜃a(\bm{q}_{m+1},\bm{k}_{m})=\bm{q}\bm{k}^{T}e^{\mathrm{i}\theta}italic_a ( bold_italic_q start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT , bold_italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) = bold_italic_q bold_italic_k start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT roman_i italic_θ end_POSTSUPERSCRIPT (18)

The Resolution ropesubscript𝑟𝑜𝑝𝑒\mathcal{E}_{rope}caligraphic_E start_POSTSUBSCRIPT italic_r italic_o italic_p italic_e end_POSTSUBSCRIPT corresponding to the initial length Lpsubscript𝐿𝑝L_{p}italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is rope=1subscript𝑟𝑜𝑝𝑒1\mathcal{E}_{rope}=1caligraphic_E start_POSTSUBSCRIPT italic_r italic_o italic_p italic_e end_POSTSUBSCRIPT = 1. After employing linear interpolation with length LLp𝐿subscript𝐿𝑝L\geq L_{p}italic_L ≥ italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, the attention score is:

a(𝒒m+1,𝒌m)=𝒒𝒌TeiLpLθ𝑎subscript𝒒𝑚1subscript𝒌𝑚𝒒superscript𝒌𝑇superscript𝑒isubscript𝐿𝑝𝐿𝜃a(\bm{q}_{m+1},\bm{k}_{m})=\bm{q}\bm{k}^{T}e^{\mathrm{i}\frac{L_{p}}{L}\theta}italic_a ( bold_italic_q start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT , bold_italic_k start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) = bold_italic_q bold_italic_k start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT roman_i divide start_ARG italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG italic_L end_ARG italic_θ end_POSTSUPERSCRIPT (19)

Note that the Resolution ropesubscript𝑟𝑜𝑝𝑒\mathcal{E}_{rope}caligraphic_E start_POSTSUBSCRIPT italic_r italic_o italic_p italic_e end_POSTSUBSCRIPT turns to rope=Lp/L1subscriptsuperscript𝑟𝑜𝑝𝑒subscript𝐿𝑝𝐿1\mathcal{E}^{\prime}_{rope}=L_{p}/L\leq 1caligraphic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_o italic_p italic_e end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT / italic_L ≤ 1 and decreases as L𝐿Litalic_L increases.

As the resolution decreases, the magnitude of the rotation of attention score becomes smaller, reflecting the extent of positional difference becomes smaller. Now we give the following theorem, explaining how 3D-RPE mitigates the resolution decreasing in detail.

Theorem 2 (Chunk Position Encoding Resolution Enhancement).

For a pre-trained language model with a pre-training length Lpsubscript𝐿𝑝L_{p}italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and an extension length requirement of L𝐿Litalic_L, employing linear position interpolation extension methods \mathcal{I}caligraphic_I based on Rotary Position Encoding (RoPE) can elevate the relative positional resolution from ropesubscript𝑟𝑜𝑝𝑒\mathcal{E}_{rope}caligraphic_E start_POSTSUBSCRIPT italic_r italic_o italic_p italic_e end_POSTSUBSCRIPT to ropesuperscriptsubscript𝑟𝑜𝑝𝑒\mathcal{E}_{rope}^{\prime}caligraphic_E start_POSTSUBSCRIPT italic_r italic_o italic_p italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, Let 3drpesuperscriptsubscript3𝑑𝑟𝑝𝑒\mathcal{E}_{3d-rpe}^{\prime}caligraphic_E start_POSTSUBSCRIPT 3 italic_d - italic_r italic_p italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denote the relative positional encoding resolution achieved by the method \mathcal{I}caligraphic_I based on 3D-RPE, with chunk size c>=3𝑐3c>=3italic_c > = 3, there is:

3drpe>ropesuperscriptsubscript3𝑑𝑟𝑝𝑒superscriptsubscript𝑟𝑜𝑝𝑒\mathcal{E}_{3d-rpe}^{\prime}>\mathcal{E}_{rope}^{\prime}caligraphic_E start_POSTSUBSCRIPT 3 italic_d - italic_r italic_p italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > caligraphic_E start_POSTSUBSCRIPT italic_r italic_o italic_p italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (20)
Proof.

For 3D-RPE, let the chunk size and chunk number be denoted as c𝑐citalic_c and n=Lp/c𝑛subscript𝐿𝑝𝑐n=\lceil L_{p}/c\rceilitalic_n = ⌈ italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT / italic_c ⌉ respectively. Prior to interpolation, the indices within a chunk range from [0,1,,c1]01𝑐1[0,1,\cdots,c-1][ 0 , 1 , ⋯ , italic_c - 1 ]. Linear interpolation involves evenly distributing the excess LLp𝐿subscript𝐿𝑝L-L_{p}italic_L - italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT tokens across n𝑛nitalic_n chunks. This results in new indices within the chunk, range from [0,1,2,,c1]012superscript𝑐1[0,1,2,\cdots,c^{\prime}-1][ 0 , 1 , 2 , ⋯ , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 ], where c=L/nLpsuperscript𝑐𝐿𝑛subscript𝐿𝑝c^{\prime}=\lceil L/n\rceil\leq L_{p}italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ⌈ italic_L / italic_n ⌉ ≤ italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. So the attention score of 𝒒i,m+1subscript𝒒𝑖𝑚1\bm{q}_{i,m+1}bold_italic_q start_POSTSUBSCRIPT italic_i , italic_m + 1 end_POSTSUBSCRIPT and 𝒌i,msubscript𝒌𝑖𝑚\bm{k}_{i,m}bold_italic_k start_POSTSUBSCRIPT italic_i , italic_m end_POSTSUBSCRIPT based on 3D-RPE after interpolation is:

a3drpesubscript𝑎3𝑑𝑟𝑝𝑒\displaystyle a_{3d-rpe}italic_a start_POSTSUBSCRIPT 3 italic_d - italic_r italic_p italic_e end_POSTSUBSCRIPT =𝒒𝒌Teiθei(φiφi)absent𝒒superscript𝒌𝑇superscript𝑒i𝜃superscript𝑒isubscript𝜑𝑖subscript𝜑𝑖\displaystyle=\bm{q}\bm{k}^{T}e^{\mathrm{i}\theta}e^{\mathrm{i}(\varphi_{i}-% \varphi_{i})}= bold_italic_q bold_italic_k start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT roman_i italic_θ end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT roman_i ( italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT
=𝒒𝒌Teiθabsent𝒒superscript𝒌𝑇superscript𝑒i𝜃\displaystyle=\bm{q}\bm{k}^{T}e^{\mathrm{i}\theta}= bold_italic_q bold_italic_k start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT roman_i italic_θ end_POSTSUPERSCRIPT

The resolution of relative position for 3D-RPE is:

3drpe=1superscriptsubscript3𝑑𝑟𝑝𝑒1\mathcal{E}_{3d-rpe}^{\prime}=1caligraphic_E start_POSTSUBSCRIPT 3 italic_d - italic_r italic_p italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1

For special cases 𝒒(i+1,0)subscript𝒒𝑖10\bm{q}_{(i+1,0)}bold_italic_q start_POSTSUBSCRIPT ( italic_i + 1 , 0 ) end_POSTSUBSCRIPT and 𝒌(i,c1)subscript𝒌𝑖superscript𝑐1\bm{k}_{(i,c^{\prime}-1)}bold_italic_k start_POSTSUBSCRIPT ( italic_i , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 ) end_POSTSUBSCRIPT:

3drpec1+(φi+1φi)θ>c21superscriptsubscript3𝑑𝑟𝑝𝑒superscript𝑐1subscript𝜑𝑖1subscript𝜑𝑖𝜃superscript𝑐21\mathcal{E}_{3d-rpe}^{\prime}\geq c^{\prime}-1+\frac{(\varphi_{i+1}-\varphi_{i% })}{\theta}>c^{\prime}-2\geq 1caligraphic_E start_POSTSUBSCRIPT 3 italic_d - italic_r italic_p italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 + divide start_ARG ( italic_φ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_θ end_ARG > italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 2 ≥ 1 (21)

where (φi+1φi)/θ1/10000>1subscript𝜑𝑖1subscript𝜑𝑖𝜃1100001(\varphi_{i+1}-\varphi_{i})/\theta\geq-1/10000>-1( italic_φ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_θ ≥ - 1 / 10000 > - 1. As long as c3superscript𝑐3c^{\prime}\geq 3italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ 3, there is 3drpe1>rope=Lp/Lsuperscriptsubscript3𝑑𝑟𝑝𝑒1superscriptsubscript𝑟𝑜𝑝𝑒subscript𝐿𝑝𝐿\mathcal{E}_{3d-rpe}^{\prime}\geq 1>\mathcal{E}_{rope}^{\prime}=L_{p}/Lcaligraphic_E start_POSTSUBSCRIPT 3 italic_d - italic_r italic_p italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ 1 > caligraphic_E start_POSTSUBSCRIPT italic_r italic_o italic_p italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT / italic_L. Under normal case, the chunk size c𝑐citalic_c is not set to a very small number, hence c3superscript𝑐3c^{\prime}\geq 3italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ 3 is certainly established; moreover, for different interpolation lengths L𝐿Litalic_L, we need to configure a varying number of chunks n𝑛nitalic_n, such that c=L/nLpsuperscript𝑐𝐿𝑛subscript𝐿𝑝c^{\prime}=\lceil L/n\rceil\leq L_{p}italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ⌈ italic_L / italic_n ⌉ ≤ italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. ∎

Appendix D Experimental Supplementary Materials

D.1 Evaluation Metrics

This section mainly presents the utilization of evaluation metrics for a total of 16 tasks from the LongBench.

Dataset Metric
Narrative QA F1_Score
Qsper F1_Score
MultiFieldQA-En F1_Score
Hotpot QA F1_Score
2WikiM QA F1_Score
Musique F1_Score
GovReport Rouge_Score
QMSum Rouge_Score
MultiNews Rouge_Score
Trec Classification_Score
Trivia QA F1_Score
SAMsum Rouge_Score
PassageRetrieval-En Retrieval_Score
Passage Count Count_Score
Lcc Code_Sim_Score
RepoBench-P Code_Sim_Score

D.2 Details of Experimental Results

This section mainly presents the performance of all tasks corresponding to each type of experiment in LongBench. These experimental results are reported in Table 5.

Table 5: Comparison of Experimental Performance on Different Datasets for Various Tasks in LongBench, Using Baseline Models Provided by LongBench. 3D-RPE-LLaMA2-7B is our model.
Single-Document QA Narrative QA Qasper MultiFieldQA-En
LLaMA2-7B-Chat-4k4𝑘4k4 italic_k 18.7 19.2 36.8
LongChat-v1.5-7B-32k32𝑘32k32 italic_k 16.9 27.7 41.4
InternLM-7B-8k8𝑘8k8 italic_k 12.1 16.7 23.4
Vicuna-v1.5-7B-16k16𝑘16k16 italic_k 19.4 26.1 38.5
LongLora-16k16𝑘16k16 italic_k 19.8 29.1 37.1
3D-RPE-LLaMA2-7B(our) 40.56 41.35 60.3
Multi-Document QA Hotpot QA 2WikiM QA Musique
LLaMA2-7B-chat-4k4𝑘4k4 italic_k 25.4 32.8 9.4
LongChat-v1.5-7B-32k32𝑘32k32 italic_k 31.5 20.6 9.7
InternLM-7B-8k8𝑘8k8 italic_k 28.7 22.8 9.0
Vicuna-v1.5-7B-16k16𝑘16k16 italic_k 25.3 20.8 9.8
LongLora-16k16𝑘16k16 italic_k 37.01 30.26 17.14
3D-RPE-LLaMA2-7B(our) 62.49 58.80 59.01
Summarization GovReport QMSum MultiNews
LLaMA2-7B-chat-4k4𝑘4k4 italic_k 27.3 20.8 25.8
LongChat-v1.5-7B-32k32𝑘32k32 italic_k 30.8 22.7 26.4
InternLM-7B-8k8𝑘8k8 italic_k 9.7 15.9 22.8
Vicuna-v1.5-7B-16k16𝑘16k16 italic_k 27.9 22.8 27.2
LongLora-16k16𝑘16k16 italic_k 31.53 24.13 27.74
3D-RPE-LLaMA2-7B(our) 32.01 25.3 29.68
Few-shot Learning Trec Trivia QA SAMSum
LLaMA2-7B-chat-4k4𝑘4k4 italic_k 61.5 77.8 40.7
LongChat-v1.5-7B-32k32𝑘32k32 italic_k 63.5 82.3 34.2
InternLM-7B-8k8𝑘8k8 italic_k 52.0 77.8 21.2
Vicuna-v1.5-7B-16k16𝑘16k16 italic_k 71.5 86.2 40.8
LongLora-16k16𝑘16k16 italic_k 63.5 85.69 41.88
3D-RPE-LLaMA2-7B-16k16𝑘16k16 italic_k(our) 89.50 90.00 40.00
Synthetic Tasks Passage Count PassageRetrival-En
LLaMA2-7B-chat-4k4𝑘4k4 italic_k 2.1 9.8
LongChat-v1.5-7B-32k32𝑘32k32 italic_k 1.0 30.5
InternLM-7B-8k8𝑘8k8 italic_k 3.0 6.0
Vicuna-v1.5-7B-16k16𝑘16k16 italic_k 6.5 4.5
LongLora-16k16𝑘16k16 italic_k 3.61 29.75
3D-RPE-LLaMA2-7B-16k(our) 4.0 14.5
Code Completion Lcc RepoBench-P
LLaMA2-7B-chat-4k4𝑘4k4 italic_k 52.4 43.8
LongChat-v1.5-7B-32k32𝑘32k32 italic_k 53.0 55.3
InternLM-7B-8k8𝑘8k8 italic_k 44.1 28.8
Vicuna-v1.5-7B-16k16𝑘16k16 italic_k 51.0 43.5
LongLora-16k16𝑘16k16 italic_k 57.61 54.45
3D-RPE-LLaMA2-7B-16k16𝑘16k16 italic_k(our) 79.10 73.90