12. Infinitely Extrapolatable ReRoPE?

Since introducing the idea of mixed radix in “The Road to Transformer Upgrade: 11. Taking $\beta$-Radix Position to the End” to further generalize NTK-aware Scaled RoPE, the author feels that the effect of similar ideas has reached its limit, and a more significant improvement must be sought through alternative means. At this point, the author recalled an idea conceived earlier, which was set aside due to its high complexity. Now that a bottleneck has been encountered, “the only way is the best way,” so it was picked up again.

Unexpectedly, although this method increases some inference complexity, its experimental results are surprisingly good—even hinting at infinite length extrapolation capability! Therefore, the author is eager to write this article to share this method. Due to its formal similarity with the ReLU activation function, the author names this method “ReRoPE (Rectified Rotary Position Embeddings)”.

Revisit
#

We know that RoPE is formally an absolute position encoding, but in practice, it brings relative position information to Attention, specifically the following Toeplitz matrix:

$$ \begin{equation}\begin{pmatrix}0 & \\ 1 & 0 & \\ 2 & 1 & 0 &\\ 3 & 2 & 1 & 0 & \\ \ddots & 3 & 2 & 1 & 0 & \\ \ddots & \ddots & 3 & 2 & 1 & 0 & \\ \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots \\ \small{L - 2} & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots \\ \small{L - 1} & \small{L - 2} & \ddots & \ddots & \ddots & 3 & 2 & 1 & 0 & \\ \end{pmatrix}\end{equation} $$

Here $L$ is the current sample length. When $L$ significantly exceeds the training length, the additional positions have not been trained, so the performance cannot be guaranteed. This is why direct extrapolation (Length Extrapolation) usually performs poorly.

Later, researchers proposed Position Interpolation, which is equivalent to changing the relative position matrix to:

$$ \begin{equation}\begin{pmatrix}0 & \\ \frac{1}{k} & 0 & \\ \frac{2}{k} & \frac{1}{k} & 0 &\\ \frac{3}{k} & \frac{2}{k} & \frac{1}{k} & 0 & \\ \ddots & \frac{3}{k} & \frac{2}{k} & \frac{1}{k} & 0 & \\ \ddots & \ddots & \frac{3}{k} & \frac{2}{k} & \frac{1}{k} & 0 & \\ \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots \\ \small{\frac{L-2}{k}} & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots \\ \small{\frac{L-1}{k}} & \small{\frac{L-1}{k}} & \ddots & \ddots & \ddots & \frac{3}{k} & \frac{2}{k} & \frac{1}{k} & 0 & \\ \end{pmatrix}\end{equation} $$

In this way, by adjusting $k$, the maximum relative position can be guaranteed not to exceed the training length, thus avoiding extrapolation. However, it makes the position information more “crowded”, so a certain number of fine-tuning steps are required for the model to work again. Precisely because it avoids extrapolation, it requires far fewer fine-tuning steps than direct extrapolation (neural networks are often better at interpolation than extrapolation).

As for the later proposed NTK-aware Scaled RoPE, it takes a “sideways approach,” cleverly distributing the extrapolation pressure across each dimension, so it can achieve good results without fine-tuning. However, it still relies on extrapolation, which is something neural networks are not good at, so there is an upper limit to its effectiveness. In the author’s experiments, its Long Context performance still cannot get very close to the training performance.

Fusion
#

We can also examine these methods from the perspective of language model locality. Locality means that when a language model predicts the next token, it clearly relies more on neighboring tokens. Direct extrapolation maintains locality (position encoding near 0 remains unchanged), but performs poorly because it introduces position encodings beyond the training length; Position Interpolation, while not extrapolating position encodings, disrupts locality (position encodings near 0 are compressed by $1/k$), so it does not perform well without fine-tuning; NTK-aware Scaled RoPE, through “high-frequency extrapolation, low-frequency interpolation,” implicitly combines the advantages of both, ensuring locality without obvious position encoding extrapolation, thus achieving good results even without fine-tuning.

Is there a method that can combine extrapolation and interpolation more directly? Yes, we can set a window size $w$. Within the window, we use a position interval of size $1$, and outside the window, we use a position interval of size $1/k$. The entire relative position matrix is as follows:

$$ \begin{equation}\begin{pmatrix} 0 & \\ 1 & 0 & \\ 2 & 1 & 0 & \\ \ddots & 2 & 1 & 0 & \\ \small{w - 1} & \ddots & 2 & 1 & 0 & \\ w & \small{w - 1} & \ddots & 2 & 1 & 0 & \\ w + \frac{1}{k} & w & \ddots & \ddots & 2 & 1 & 0 & \\ w + \frac{2}{k} & w + \frac{1}{k} & \ddots & \ddots & \ddots & 2 & 1 & 0 & \\ \ddots & w + \frac{2}{k} & \ddots & \ddots & \ddots & \ddots & 2 & 1 & 0 & \\ \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \\ \ddots & \ddots & \ddots & w + \frac{2}{k} & w + \frac{1}{k} & w & \small{w - 1} & \ddots & 2 & 1 & 0 & \\ w + \frac{L-1-w}{k} & \ddots & \ddots & \ddots & w + \frac{2}{k} & w + \frac{1}{k} & w & \small{w - 1} & \ddots & 2 & 1 & 0 & \\ \end{pmatrix}\end{equation} $$

As long as $w$ is less than the training length, by controlling $k$, we can ensure that all position encodings do not exceed the training length, while precisely maintaining locality. This simply and directly combines direct extrapolation and position interpolation.

In particular, the matrices above have a special case: when $k\to\infty$, it simplifies to

$$ \begin{equation}\begin{pmatrix} 0 & \\ 1 & 0 & \\ 2 & 1 & 0 & \\ \ddots & 2 & 1 & 0 & \\ \small{w - 1} & \ddots & 2 & 1 & 0 & \\ w & \small{w - 1} & \ddots & 2 & 1 & 0 & \\ w & w & \ddots & \ddots & 2 & 1 & 0 & \\ w & w & \ddots & \ddots & \ddots & 2 & 1 & 0 & \\ \ddots & w & \ddots & \ddots & \ddots & \ddots & 2 & 1 & 0 & \\ \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \\ \ddots & \ddots & \ddots & w & w & w & \small{w - 1} & \ddots & 2 & 1 & 0 & \\ w & \ddots & \ddots & \ddots & w & w & w & \small{w - 1} & \ddots & 2 & 1 & 0 & \\ \end{pmatrix}\end{equation} $$

In this case, regardless of the input length, the range of its position encoding does not exceed $w$, so this is a scheme that potentially supports arbitrary length Context!

Formally, the relationship between the matrices above and the standard RoPE matrix is equivalent to the relationship between ReLU, Leaky ReLU, and Linear. Therefore, the author calls the former “ReRoPE (Rectified RoPE)” and the latter “Leaky ReRoPE”.

Computation
#

In fact, similar ideas are not difficult to come up with. Previous relative position encodings based on Attention Bias (such as Classic Relative Position Encoding, T5 Position Encoding) often use such block operations. However, unlike these relative position encodings, implementing such block operations in RoPE will significantly increase the computation, which is the main reason why this idea was set aside by the author.

How to understand the increase in computation? We know that RoPE achieves “relative position through absolute position,” which can only result in linear relative positions. The matrices above are nonlinear (or piecewise linear). To implement it, one can only calculate the Attention matrix twice and then combine them. Specifically, first calculate the Attention matrix (before Softmax) once using standard RoPE:

$$ \begin{equation}a_{i,j}^{(1)} = \left(\boldsymbol{\mathcal{R}}^i\boldsymbol{q}_i\right)^{\top}\left(\boldsymbol{\mathcal{R}}^j\boldsymbol{k}_j\right) = \boldsymbol{q}_i^{\top}\boldsymbol{\mathcal{R}}^{j-i}\boldsymbol{k}_j\end{equation} $$

Here the first equality is the implementation method, and the second equality is the equivalent result. $\boldsymbol{\mathcal{R}}$ is the rotation matrix of RoPE. For simplicity, we omit the Attention scale factor. Next, we need to calculate the Attention matrix for RoPE with an interval of $1/k$ (Leaky ReRoPE):

$$ \begin{equation}a_{i,j}^{(2)} = \left(\boldsymbol{\mathcal{R}}^{(i-w)/k+w}\boldsymbol{q}_i\right)^{\top}\left(\boldsymbol{\mathcal{R}}^{j/k}\boldsymbol{k}_j\right) = \boldsymbol{q}_i^{\top}\boldsymbol{\mathcal{R}}^{(j-i+w)/k-w}\boldsymbol{k}_j\end{equation} $$

If it is ReRoPE, it is simpler:

$$ \begin{equation}a_{i,j}^{(2)} = \left(\boldsymbol{\mathcal{R}}^w\boldsymbol{q}_i\right)^{\top}\boldsymbol{k}_j = \boldsymbol{q}_i^{\top}\boldsymbol{\mathcal{R}}^w\boldsymbol{k}_j\end{equation} $$

Finally, based on the condition $i - j < w$, combine them:

$$ \begin{equation}a_{i,j} = \left\{\begin{aligned} &a_{i,j}^{(1)},\quad (i - j < w) \\[8pt] &a_{i,j}^{(2)}, \quad (i - j \geq w) \end{aligned}\right.\end{equation} $$

Whether it’s ReRoPE or Leaky ReRoPE, it inevitably requires calculating the Attention matrix twice (if there is a more efficient implementation, please enlighten me), which is one of the increased computational costs. Furthermore, the need to customize the Attention matrix calculation means that existing flash attention implementations cannot be directly used, thus relatively increasing computational cost.

On the other hand, also due to the nonlinear relative position, during autoregressive decoding, the Key sequence cache can only store the sequence before RoPE. Then, at each decoding step, the corresponding RoPE needs to be added to the entire Key sequence. This modification will also increase inference computational cost. The only good news is that during token-by-token decoding, starting from the second step, the length of the Query sequence is 1. At this time, only the RoPE for the Key sequence needs to be customized, so the Attention matrix can be calculated only once:

$$ \begin{equation}a_{i,j} = \left\{\begin{aligned} &\boldsymbol{q}_i^{\top}\left(\boldsymbol{\mathcal{R}}^{\max(j-i,-w)}\boldsymbol{k}_j\right), \quad(\text{ReRoPE})\\[8pt] &\boldsymbol{q}_i^{\top}\left(\boldsymbol{\mathcal{R}}^{\max(j-i,(j-i+w)/k-w)}\boldsymbol{k}_j\right), \quad(\text{Leaky ReRoPE}) \end{aligned}\right.\end{equation} $$

Experiments
#

Continuing with the settings from “The Road to Transformer Upgrade: 11. Taking $\beta$-Radix Position to the End”, we conducted experiments on ReRoPE, and the results are shown in the table below:

Test Length	512(Train)	4096(Repeated)	4096(Non-repeated)
Baseline	49.41%	24.17%	23.16%
Baseline-$\log n$	49.40%	24.60%	24.02%
PI-RoPE	49.41%	15.04%	13.54%
PI-RoPE-$\log n$	49.40%	14.99%	16.51%
NTK-RoPE-old	49.41%	51.28%	39.27%
NTK-RoPE-$\log n$-old	49.40%	61.71%	43.75%
NTK-RoPE-fixed	49.41%	51.86%	39.61%
NTK-RoPE-$\log n^{\dagger}$-fixed	49.41%	55.94%	41.11%
NTK-RoPE-$\log n$-fixed	49.40%	62.85%	44.14%
NTK-RoPE-mixed	49.41%	53.09%	40.12%
NTK-RoPE-$\log n^{\dagger}$-mixed	49.41%	59.11%	42.38%
NTK-RoPE-$\log n$-mixed	49.40%	68.91%	45.41%
ReRoPE-w256	49.41%	77.90%	48.48%
ReRoPE-w256-$\log n^{\dagger}$	49.41%	82.40%	48.85%
ReRoPE-w256-$\log n$	49.40%	$\boldsymbol{85.12\%}$	$\boldsymbol{49.07\%}$
HFWA	48.70%	80.84%	48.15%

As mentioned at the beginning of the article, the extrapolation effect of ReRoPE without fine-tuning is surprisingly good. It not only significantly surpasses the previous best NTK-RoPE-mixed but also clearly exceeds HFWA, which was pre-trained from scratch! Here, $\text{w256}$ refers to $w=256$, $\log n^{\dagger}$ means that $\log n$ scaling was not included during pre-training (like LLAMA), and during the test phase, each $\boldsymbol{q}_n$ is multiplied by $\max(1, \log_{\text{maxlen}} n)$, while $\log n$ means that the $\log n$ scaling factor was included during pre-training.

Here are some ablation experiments, showing that ReRoPE is quite robust to $w$, with the optimal value being roughly $1/4 \sim 1/2$ of the training length:

Test Length	512(Train)	4096(Repeated)	4096(Non-repeated)
ReRoPE-w64	49.41%	69.39%	45.19%
ReRoPE-w64-$\log n^{\dagger}$	49.41%	78.58%	47.42%
ReRoPE-w64-$\log n$	49.40%	84.38%	48.14%
ReRoPE-w128	49.41%	76.11%	47.82%
ReRoPE-w128-$\log n^{\dagger}$	49.41%	82.28%	48.78%
ReRoPE-w128-$\log n$	49.40%	$\boldsymbol{85.47\%}$	48.87%
ReRoPE-w256	49.41%	77.90%	48.48%
ReRoPE-w256-$\log n^{\dagger}$	49.41%	82.40%	48.85%
ReRoPE-w256-$\log n$	49.40%	85.12%	$\boldsymbol{49.07\%}$
ReRoPE-w384	49.41%	70.72%	48.15%
ReRoPE-w384-$\log n^{\dagger}$	49.41%	76.42%	48.31%
ReRoPE-w384-$\log n$	49.40%	83.24%	48.62%
ReRoPE-w512	49.41%	7.09%	8.25%
ReRoPE-w512-$\log n^{\dagger}$	49.41%	7.08%	8.25%
ReRoPE-w512-$\log n$	49.40%	15.84%	10.83%

The table below compares ReRoPE and Leaky ReRoPE:

Test Length	512(Train)	4096(Repeated)	4096(Non-repeated)
ReRoPE-w128-$\log n$	49.40%	$\boldsymbol{85.47\%}$	48.87%
Leaky ReRoPE-w128-k64-$\log n$	49.40%	85.29%	48.96%
Leaky ReRoPE-w128-k32-$\log n$	49.40%	85.31%	49.03%
Leaky ReRoPE-w128-k16-$\log n$	49.40%	85.15%	$\boldsymbol{49.10\%}$
Leaky ReRoPE-w128-k8-$\log n$	49.40%	80.00%	48.11%
ReRoPE-w256-$\log n$	49.40%	85.12%	$\boldsymbol{49.07\%}$
Leaky ReRoPE-w256-k64-$\log n$	49.40%	84.60%	49.03%
Leaky ReRoPE-w256-k32-$\log n$	49.40%	84.30%	48.97%
Leaky ReRoPE-w256-k16-$\log n$	49.40%	83.59%	48.87%
Leaky ReRoPE-w256-k8-$\log n$	49.40%	69.80%	45.72%

As a generalization of ReRoPE, fine-tuned Leaky ReRoPE has the potential to surpass ReRoPE, but the improvement is very slight. In addition, when $k$ is a finite value, the maximum length that can be handled is also limited, because we cannot know the total length to be generated in advance, so we can only preset a sufficiently large $k$. However, after setting it to a finite value, when the input is long enough, the performance will significantly drop because the position encoding exceeds the training length. In contrast, ReRoPE does not have this risk. Overall, the value of fine-tuning Leaky ReRoPE compared to ReRoPE seems not significant.

The above experimental results were all tested on a 100 million parameter GAU model. Below are the test results based on llama2-13b (the metric is loss, lower is better), representing the performance on a real LLM:

Test Length	4096(Train)	8192	16384
RoPE	1.4967	8.8615	-
NTK-RoPE	1.6081	1.5417	1.5163
ReRoPE	1.4996	1.4267	1.4001

As can be seen, ReRoPE truly achieved almost no loss in training performance (RoPE-4096 represents training performance) and satisfies the ideal characteristic of “longer context, lower loss” (more context should be more helpful for prediction). In addition, the author also tested the chat effect on the OpenBuddy open-source LLAMA2-13b fine-tuned model, and it felt quite good (tested with Context up to 20k tokens).

Finally, sharing the author’s code implementation of ReRoPE and Leaky ReRoPE based on the transformers LLAMA model. Readers can also load the LLAMA series models for testing:

Github: https://github.com/bojone/rerope

Summary (formatted)
#

In this article, the author proposes ReRoPE (Rectified RoPE), which is also a post-processing scheme for RoPE. Experimental results show that its zero-shot length extrapolation capability not only significantly surpasses the previous NTK-aware Scaled RoPE but also exceeds HFWA, which was specially designed and required training from scratch. Furthermore, unlike NTK-aware Scaled RoPE, whose performance drops sharply after a certain length, ReRoPE seems to perform well at any length. In addition to comparative experiments, the article also provides a reference implementation based on transformers-llama, which interested readers can test themselves.

@online{kexuefm-9708,
        title={The Road to Transformer Upgrade: 12. Infinitely Extrapolatable ReRoPE?},
        author={苏剑林},
        year={2023},
        month={08},
        url={\url{https://kexue.fm/archives/9708}},
}