gemini-2.5-flash-preview-04-17
translation of a Chinese article. Beware of potential errors.In the previous article, “《The Road to Transformer Upgrade: 13. Inverse Use of Leaky ReRoPE》”, the author attempted to use the idea of inversely applying Leaky ReRoPE during the training phase, so that the position encoding becomes normal RoPE during the inference phase, thus achieving length extrapolation while solving the problem of ReRoPE slowing down inference. Unfortunately, from the experimental results, the “Leaky ReRoPE → RoPE” effect was not as good as “RoPE → ReRoPE/Leaky ReRoPE”, so this problem has not been fully resolved.
At this point, the author thought that the HWFA proposed earlier in “《The Road to Transformer Upgrade: 9. A New Approach to Global Length Extrapolation》” itself has a certain length extrapolation capability. If it is combined with ReRoPE to “join forces”, would it yield better results? More importantly, the addition of HWFA can significantly reduce inference costs, thereby compensating for ReRoPE’s shortcomings!
Warm-up#
First, let’s “routinely” review HWFA. HWFA (Hybrid Window-Full Attention) is not a specific model, but rather a combination method for Attention. It can enhance the length extrapolation capability of Attention models while basically maintaining the same performance, and it can also reduce training and inference costs.
Specifically, HWFA is “$L-1$ layers of Window RoPE Attention + $1$ layer of Full NoPE Attention”. This means that the first $L-1$ layers of Attention all add RoPE and limit the receptive field through a window. This way, the inference cost becomes constant, and based on block parallel optimization, the training speed can also be increased. As for the last layer of Attention, it retains the global form but removes position encoding (NoPE), and adds $\log n$ scaling. After such modifications and appropriate window selection, the model’s training performance only slightly decreases, while exhibiting excellent length extrapolation capability.
Coincidentally, Google later proposed FOT (Focused Transformer), which shares many similarities with HWFA: it also consists of $L-1$ layers of Local Attention plus $1$ layer of Full Attention, and the Full Attention is also NoPE. The difference is that FOT places the Full Attention in the middle, and the Local Attention does not strictly limit the receptive field, so it cannot directly extrapolate length. Therefore, it proposed crossbatch training to extend the model length. Afterwards, the author experimented with using crossbatch training on HWFA, and it also had good results.
New Knowledge#
Returning to the topic of this article, how can HWFA and ReRoPE “join forces”? We know that ReRoPE is used on Full RoPE Attention, by truncating the relative position matrix during the inference phase:
$$ \begin{pmatrix}0 & \\ 1 & 0 & \\ 2 & 1 & 0 &\\ \ddots & 2 & 1 & 0 & \\ \ddots & \ddots & 2 & 1 & 0 & \\ \ddots & \ddots & \ddots & \ddots & \ddots & \ddots \\ \small{L - 2} & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots \\ \small{L - 1} & \small{L - 2} & \ddots & \ddots & \ddots & 2 & 1 & 0 & \\ \end{pmatrix} \,\to\, \begin{pmatrix} \color{red}{0} & \\ \color{red}{1} & \color{red}{0} & \\ \color{red}{\ddots} & \color{red}{1} & \color{red}{0} & \\ \color{red}{\small{w - 1}} & \color{red}{\ddots} & \color{red}{1} & \color{red}{0} & \\ \color{green}{w} & \color{red}{\small{w - 1}} & \color{red}{\ddots} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\ddots} & \color{green}{w} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{1} & \color{red}{0} & \\ \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \color{red}{\ddots} & \\ \color{green}{w} & \color{green}{\ddots} & \color{green}{\ddots} & \color{green}{w} & \color{red}{\small{w - 1}} & \color{red}{\ddots} & \color{red}{1} & \color{red}{0} & \\ \end{pmatrix} $$Surprisingly, such post-processing shows excellent length extrapolation capability. However, due to the special nature of RoPE, the original ReRoPE implementation requires calculating the Attention matrix twice and is not compatible with mainstream Flash Attention acceleration, etc. Overall, the increased cost during the inference phase is slightly significant.
However, the addition of HWFA will greatly alleviate this problem! As mentioned earlier, ReRoPE is only used on Full RoPE Attention, and HWFA is mostly Window RoPE Attention. Therefore, the “HWFA+ReRoPE” scheme is readily apparent: during the training phase, replace the original Full NoPE Attention in HWFA with Full RoPE Attention, and then change it to Full ReRoPE Attention during the inference phase. This way, the extra cost brought by switching to ReRoPE during the inference phase will be very small, and the benefits from replacing other layers with Window Attention will be even more significant.
In addition, “HWFA+ReRoPE” can also compensate for the performance loss of the original HWFA. Previously, to ensure length extrapolation capability, the Full Attention in HWFA had to remove position encoding (i.e., NoPE), and the receptive field $\tilde{w}$ of Window Attention had to satisfy $(\tilde{w}-1)(L-1)+1 = \alpha N$ (where $L$ is the number of layers, $N$ is the training length, and $0 < \alpha \leq 1$). These constraints limited the model’s expressive power, leading to reduced training performance. With the introduction of ReRoPE, the receptive field of Window Attention can be appropriately increased, Full Attention can also use RoPE, and it can be placed in the middle layers instead of just the last layer, or even have more than $1$ layer of Full Attention. These changes can compensate for performance loss, and thanks to ReRoPE, the length extrapolation capability will not decrease.
To distinguish from the initial version of HWFA, the combination of “HWFA+ReRoPE” can also be called “HWFA2”.
Experiments#
Below are some experimental results for “HWFA+ReRoPE (HWFA2)”. Since the introduction of ReRoPE gives HWFA much more flexibility, the comparison only involves combinations that the author considers more intuitive, and cannot fully verify all permutations and combinations.
The experimental model is the same as before for HWFA and ReRoPE, a 100 million parameter GAU model, trained with a length of 512. Note that there are two window parameters here: one is the $w$ parameter specific to ReRoPE itself. Previous ReRoPE experiments showed that this has little impact, so it is uniformly set to 256 below. The other is the receptive field of HWFA’s Window Attention, denoted as $\tilde{w}$, which is tunable. Therefore, the main parameters of “HWFA+ReRoPE” are the size of $\tilde{w}$ for Window Attention, and the number and distribution of Full Attention layers. Previously, the author conducted some comparative experiments, which showed that from the perspective of training performance, placing Full Attention in the middle is better than placing it at the end. So if there is 1 layer of Full Attention, its default position is the layer at index = num_layers / 2. If there are 2 layers of Full Attention, the default positions are the layers at index = num_layers / 3 and index = 2 * num_layers / 3, and so on.
Some experimental results are as follows:
Test Length | 512 (Train) | 4096 (Repeat) | 4096 (Non-repeat) |
---|---|---|---|
Baseline | 49.41% | 24.17% | 23.16% |
Baseline-$\log n$ | 49.40% | 24.60% | 24.02% |
ReRoPE-w256 | 49.41% | 77.90% | 48.48% |
ReRoPE-w256-$\log n^{\dagger}$ | 49.41% | 82.40% | 48.85% |
ReRoPE-w256-$\log n$ | 49.40% | 85.12% | 49.07% |
InvLeaky ReRoPE-w128-$\log n$ | 49.38% | 82.25% | 48.32% |
InvLeaky ReRoPE-w128-b8-$\log n$ | 49.62% | 81.15% | 48.85% |
HFWA | 48.70% | 80.84% | 48.15% |
HFWA-ReRoPE-w32-f1 | 49.29% | 83.13% | 49.34% |
HFWA-ReRoPE-w64-f1 | 49.32% | 82.41% | 49.37% |
HFWA-ReRoPE-w128-f1 | 49.21% | 80.18% | 48.99% |
HFWA-ReRoPE-w256-f1 | 49.00% | 54.94% | 47.64% |
HFWA-ReRoPE-w32-f2 | 49.50% | 84.09% | 49.35% |
HFWA-ReRoPE-w64-f2 | 49.46% | 84.43% | 49.36% |
HFWA-ReRoPE-w128-f2 | 49.35% | 83.09% | 48.97% |
HFWA-ReRoPE-w256-f2 | 49.37% | 75.24% | 48.42% |
In the table above, the number after $\text{w}$ is the size of the receptive field $\tilde{w}$ for Window Attention, and the number after $\text{f}$ is the number of Full Attention layers. The original HFWA, due to various constraints, only used $\tilde{w}$ up to 16, as larger values would significantly reduce length extrapolation capability. As seen from the table, increasing $\tilde{w}$ allows the training performance to quickly align with the baseline, and further increasing the number of Full Attention layers even surpasses the baseline. Regarding extrapolation performance, both the $\text{w32}$ and $\text{w64}$ cases are quite good, significantly exceeding HFWA. Overall, the best combination for HFWA-ReRoPE is $\text{w64-f2}$. Both the training performance and the non-repeated extrapolation performance exceed the original ReRoPE. Considering that the training length $N$ is 512 and the number of layers $L$ is 24, it is speculated that the optimal value for $\tilde{w}$ should be around $2\sim 4$ times $N/L$.
Summary (formatted)#
This article proposes a combined usage method for HWFA and ReRoPE. Small-scale experimental results show that this combination can achieve near-optimal length extrapolation without losing training performance. Furthermore, thanks to the design of HWFA, it can significantly reduce inference costs, effectively alleviating the drawback of increased inference cost in the original ReRoPE.
@online{kexuefm-9731,
title={The Road to Transformer Upgrade: 14. When HWFA Meets ReRoPE},
author={苏剑林},
year={2023},
month={08},
url={\url{https://kexue.fm/archives/9731}},
}