13. Inverse Use of Leaky ReRoPE

Last week, in 《Transformer Upgrade Path: 12. Infinite Extrapolation with ReRoPE?》, I proposed ReRoPE and Leaky ReRoPE. Numerous experimental results showed that they can extend the context length of LLMs fine-tuning-free with almost no loss in training performance, and they achieve the ideal characteristic of “longer context, lower loss”. Furthermore, unlike NTK-aware Scaled RoPE, ReRoPE seems to also exhibit infinite context processing capability.

Overall, ReRoPE looks quite satisfactory, but the minor drawback is that it increases inference cost, specifically requiring two Attention calculations in the first step of inference and recalculating the positional encoding for each subsequent step. This article attempts to solve this problem by using Leaky ReRoPE inversely during training.

Review
#

Let’s reiterate without getting tired of it: RoPE is formally an absolute positional encoding, but the actual effect achieved is relative positional encoding, with the corresponding relative position matrix being:

$$ \begin{pmatrix}0 & \\ 1 & 0 & \\ 2 & 1 & 0 &\\ 3 & 2 & 1 & 0 & \\ \ddots & 3 & 2 & 1 & 0 & \\ \ddots & \ddots & 3 & 2 & 1 & 0 & \\ \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots \\ \small{L - 2} & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots \\ \small{L - 1} & \small{L - 2} & \ddots & \ddots & \ddots & 3 & 2 & 1 & 0 & \\ \end{pmatrix} $$

To preserve locality while avoiding position out-of-bounds issues caused by Long Context, Leaky ReRoPE changes the relative position matrix during the inference phase to:

$$ \begin{pmatrix} \color{red}{0} & \\ \color{red}{1} & \color{red}{0} & \\ \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{red}{\small{w - 1}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{w} & \color{red}{\small{w - 1}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\small{w + \frac{1}{k}}} & \color{green}{w} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\small{w + \frac{2}{k}}} & \color{green}{\small{w + \frac{1}{k}}} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\ddots} & \color{green}{\small{w + \frac{2}{k}}} & \color{green}{\ddots} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \\ \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{w} & \color{green}{w} & \color{green}{w} & \color{red}{\small{w - 1}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\small{w + \frac{L-1-w}{k}}} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\small{w + \frac{2}{k}}} & \color{green}{\small{w + \frac{1}{k}}} & \color{green}{w} & \color{red}{\small{w - 1}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \end{pmatrix} $$

where $w$ is the window width, roughly $\frac{1}{4}$ to $\frac{1}{2}$ of the training length, and $k$ is used to adjust the maximum processable length, generally chosen such that $w + \frac{L-1-w}{k}$ does not exceed half of the training length. As for ReRoPE, it directly takes the limit $k\to\infty$:

$$ \begin{pmatrix} \color{red}{0} & \\ \color{red}{1} & \color{red}{0} & \\ \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{red}{\small{w - 1}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{w} & \color{red}{\small{w - 1}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{w} & \color{green}{w} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{w} & \color{green}{w} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\ddots} & \color{green}{w} & \color{green}{\ddots} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \\ \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{w} & \color{green}{w} & \color{green}{w} & \color{red}{\small{w - 1}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \color{green}{w} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{w} & \color{green}{w} & \color{green}{w} & \color{red}{\small{w - 1}} & \color{red}{\ddots} & \color{red}{2} & \color{red}{1} & \color{red}{0} & \\ \end{pmatrix} $$

Inversion
#

Based on the evaluation results from the previous post, ReRoPE and Leaky ReRoPE are quite satisfactory as fine-tuning-free extrapolation schemes, as they neither lose performance within the training length nor achieve “Longer Context, Lower Loss”. The only minor drawback is that their inference speed is slower compared to the original Attention, and they are not currently compatible with acceleration techniques like Flash Attention.

So, can we do the opposite? ReRoPE/Leaky ReRoPE has normal speed RoPE during the training phase and is slower during the inference phase. Conversely, can we make the training phase slower and the inference phase revert to conventional RoPE? Some readers might wonder: why would we want to make the training phase slower? Isn’t the training cost higher? This is because ReRoPE/Leaky ReRoPE is a length extrapolation method, intended for “Train Short, Test Long”. The slower training speed is temporary and controllable, while the slower inference speed is long-term and difficult to bear. Therefore, comparatively, if the slowdown is of a similar magnitude, we are more willing to put the slower part into the training phase.

Let’s look at Leaky ReRoPE again. Its relative position matrix during the training phase is equation with a step size of 1, while during the inference phase it uses a step size of 1 within the window $w$ and a step size of $\frac{1}{k} < 1$ outside the window, i.e., equation. In other words, the difference is using a smaller step size outside the window during the inference phase. If we reverse this and use Leaky ReRoPE during the training phase, and let its step size outside the window be greater than 1, then according to the principle of “using a smaller step size outside the window during the inference phase”, can the inference phase outside the window use a step size equal to 1, thus degrading back to RoPE?

I call this idea “InvLeaky ReRoPE (Inverse Leaky ReRoPE)”. Without further ado, let’s conduct experiments immediately.

Experiments
#

Continuing the previous “GAU + Deep Norm + Tiger + Language Model” experimental setup, we use Leaky ReRoPE with $k=1/16$ and $w=128$ during the training phase, and normal RoPE during the inference phase. The test results are as follows:

Test Length	512 (Training)	4096 (Repeated)	4096 (Non-repeated)
Baseline	49.41%	24.17%	23.16%
Baseline-$\log n$	49.40%	24.60%	24.02%
NTK-RoPE-fixed	49.41%	51.86%	39.61%
NTK-RoPE-$\log n^{\color{red}{\dagger}}$-fixed	49.41%	55.94%	41.11%
NTK-RoPE-$\log n$-fixed	49.40%	62.85%	44.14%
NTK-RoPE-mixed	49.41%	53.09%	40.12%
NTK-RoPE-$\log n^{\color{red}{\dagger}}$-mixed	49.41%	59.11%	42.38%
NTK-RoPE-$\log n$-mixed	49.40%	68.91%	45.41%
ReRoPE-w256	49.41%	77.90%	48.48%
ReRoPE-w256-$\log n^{\color{red}{\dagger}}$	49.41%	82.40%	48.85%
ReRoPE-w256-$\log n$	49.40%	$\boldsymbol{85.12\%}$	$\boldsymbol{49.07\%}$
InvLeaky ReRoPE-w128-$\log n$	49.38%	82.25%	48.32%
InvLeaky ReRoPE-w128-b8-$\log n$	49.62%	81.15%	48.85%
HFWA	48.70%	80.84%	48.15%

Here, $\text{b8}$ means the RoPE frequency base was changed from 10000 to 80000. As can be seen, the “Leaky ReRoPE → RoPE” InvLeaky ReRoPE, while not as effective as “RoPE → ReRoPE/Leaky ReRoPE”, still outperforms HFWA. Furthermore, because the inference phase is standard RoPE, it can utilize existing acceleration techniques, thus still being quite competitive. In addition, the author performed some simple tuning on parameters like $k, w, b$, and found that the optimal solutions are basically the two combinations above, which means “$k$ is set to ’the reciprocal of twice the extension factor’, $w$ is set to $\frac{1}{4}$ of the training length, and $b$ can optionally be multiplied by the extension factor”.

So, how much does InvLeaky ReRoPE affect training speed? In the above experiments, the model has 100 million parameters, the training length is 512, and the training time per 1000 steps increased from 330 seconds to 350 seconds, an increase of less than 10%. Of course, this is partly due to GAU, as GAU is a single-head attention mechanism and is inherently faster than multi-head attention. For multi-head attention or longer training lengths, the increase is expected to be larger, but it is visually estimated to be within 50%, which is acceptable.

Summary (formatted)
#

This article proposes the “inverse use” of Leaky ReRoPE. By using Leaky ReRoPE with a larger step size during the training phase, the inference phase can revert to conventional RoPE, thereby maintaining the inference speed. Experimental results show that this approach is still somewhat competitive.

@online{kexuefm-9728,
        title={Transformer Upgrade Path: 13. Inverse Use of Leaky ReRoPE},
        author={苏剑林},
        year={2023},
        month={08},
        url={\url{https://kexue.fm/archives/9728}},
}