9. A New Approach to Global Length Extrapolation

Speaking of the reasons why Transformers cannot handle ultra-long sequences, the first reaction is usually the quadratic complexity of Self Attention. But in fact, even ignoring computational limitations, conventional Transformers cannot handle ultra-long sequences because their Length Extrapolation is not good. Specifically, when the input sequence significantly exceeds the training length, the model’s performance usually declines severely.

Although there has been some related work, the length extrapolation problem is still far from being practically solved. This article introduces a reference solution conceived by the author, which may be the only length extrapolation method that can be used in generative models and has global dependency capabilities.

Method Review
#

Length Extrapolation, also known as Length Generalization, has been introduced in parts of our previous articles “The Road to Upgrading Transformer: 7. Length Extrapolation and Local Attention” and “The Road to Upgrading Transformer: 8. Length Extrapolation and Position Robustness”. However, they all have their own problems.

The various schemes introduced in the first article are all ideas that localize attention. Although they show improvement in metrics, they are essentially just making the metrics look a bit better and cannot achieve extrapolation with global dependency. Therefore, they are of no practical help for scenarios that truly require long-range dependency (such as In Context Learning). The latter enhances robustness to positional signals through random positional perturbation, which theoretically might preserve global dependency, but this method is only suitable for Encoder models and not for autoregressive generative models like GPT.

Therefore, the length extrapolation problem is still an urgent but unresolved issue for Transformers. In fact, this problem exists not only in Transformers. As we introduced in “Google’s New Work Attempts to ‘Revive’ RNN: Can RNN Be Glorious Again?”, the linear RNN models (including the popular RWKV) also do not have good length extrapolation capabilities. In the current LLM era, length extrapolation capability is particularly important because we always hope that the model can handle arbitrarily long text, but it is impossible to stretch the length of training samples to arbitrary length.

Translation Invariance
#

Next, we will focus on autoregressive Transformers, but the method is also effective for bidirectional attention Encoders. Essentially, localizing attention endows the entire model with “translation invariance” by restricting the attention’s perception range. A simple benchmark for translation invariance is Window Attention, as shown in the figure below:

Window Attention

Stacked Receptive Field Diagram

Assume the model contains $L$ layers of stacked Window Attention with a Window size of $w$. Then the maximum receptive field of each token in the last layer is $(w-1)L+1$. Therefore, assuming the training length is $N$, under the constraint $(w-1)L+1 = \alpha N\,(0 < \alpha \leq 1)$, the model can gain a certain degree of translation invariance because the model’s maximum receptive field does not exceed $N$. Thus, the total receptive field of the model is sufficiently trained. The smaller $\alpha$ is, the better the translation invariance usually is.

However, while this ensures the emergence of translation invariance, it brings other problems. The most serious one is that since the receptive field of each layer is limited within $w$, the capability of the attention mechanism is greatly weakened, leading to training results inferior to conventional attention (hereinafter referred to as Full Attention). Furthermore, our expectation for length extrapolation is not just “translation invariance”, but “translation betterness”. That is to say, the performance should get better as the sequence goes on (for example, in In Context Learning scenarios, the more examples are given, the better the performance should be). Therefore, the model should also be able to capture global dependencies.

Global Dependency
#

To this end, the author thought: The results obtained by Window Attention are essentially a kind of $n$-gram feature, except that $n$ becomes relatively large with multiple layers stacked. A single layer of Full Attention can be regarded as a kind of “retrieval” (as can be seen from the names query, key, value) and “fusion”. Its pattern is relatively easy to analyze. Previously, in “Understanding Attention’s Scale Operation from Entropy Invariance”, we concluded that single-layer (full) attention can enhance length extrapolation by adding a $\log n$ scaling factor.

Therefore, the author had an idea:

If the first $L-1$ layers obtain $n$-gram features through Window Attention, can the last layer be replaced with Full Attention with a $\log n$ factor to retrieve and integrate these features, so as to compensate for the performance gap and gain global dependency capabilities?

To this end, we propose the following combination of attention mechanisms (Hybird Window-Full Attention, or HWFA for short):

The first $L-1$ layers use “Window Attention+RoPE” with a Window size $w$, satisfying the constraint $(w-1)(L-1)+1 = \alpha N$, where $N$ is the training length. To balance training performance and extrapolation performance, it is recommended to choose the largest possible $w$ under the premise of $\alpha\leq 3/4$;
The $L$-th layer uses Full Attention with a $\log n$ factor, but does not use RoPE.

The reason for using RoPE in the preceding layers is that numerous experimental results have shown that RoPE helps enhance model performance (at least for base and large models). RoPE is not used in the last layer because RoPE beyond the training length has not been trained and will affect the length extrapolation effect. In fact, the RoPE in the first $L-1$ layers is already sufficient to supplement the model with positional information. Not adding RoPE in the last layer basically will not affect the model’s training performance.

Experimental Results
#

Obviously, HWFA is a combination of attention mechanisms, which can be used in standard multi-head attention or in attention variants like GAU. The author conducted experiments based on GAU_alpha: training length 512, 24 layers of GAU, the first 23 layers use Window Attention with a window size $w=16$. The metric tested is token-by-token accuracy. The Baseline is all layers using Full Attention+RoPE (i.e., the standard default usage).

The results are very encouraging:

$$ \begin{array}{c|cc} \hline \text{测试长度} & 512 & 4096 \\ \hline \text{Baseline} & 49.41\% & 24.17\% \\ \text{HFWA} & 48.70\% & 80.84\% \\ \hline \end{array} $$

512 represents training accuracy (also called interpolation accuracy), and 4096 represents extrapolation accuracy. Why is the training accuracy only in the 40s, while the extrapolation accuracy can reach such an exaggerated 80s? This is because when constructing the test samples, the author included some repeating samples, i.e., a text segment not exceeding 4096 length is repeated and concatenated to reach 4096 length. Since the latter part of these samples is a repetition of the earlier part, the accuracy of this part is very high (i.e., the correct answer has already been provided in the earlier part). This shows that, as we imagined, length extrapolation with such a design does not sacrifice global dependency capabilities.

If the repeating samples are removed and only normal natural text samples are kept, the results are still acceptable:

$$ \begin{array}{c|cc} \hline \text{测试长度} & 512 & 4096 \\ \hline \text{Baseline} & 49.41\% & 23.16\% \\ \text{HFWA} & 48.70\% & 48.15\% \\ \hline \end{array} $$

To further verify the global dependency capability, the author also did the even pairs task from “The Road to Upgrading Transformer: 8. Length Extrapolation and Position Robustness” (determining if the first and last characters are the same). The method in this article achieved 100% extrapolation accuracy, which also shows that the model can learn global dependencies (attention needs to span the entire sequence to accurately determine if they are the same).

The author also conducted some ablation studies, and the results are as follows:

Window Attention without RoPE, both interpolation and extrapolation performance decrease;
Full Attention with RoPE, extrapolation performance decreases;
Full Attention without the $\log n$ factor, extrapolation performance decreases;
Using only Window Attention, both interpolation and extrapolation performance decrease;
Changing to $L-2$ layers of Window Attention + 2 layers of Full Attention, extrapolation performance decreases;
With $w=32$ (at this point $(w-1)(L-1) > N$), extrapolation performance decreases.

Comparative Analysis
#

Some readers might ask: Why is there no comparison with other methods? The reason might be something everyone didn’t expect - because when the author experimented with some methods from “The Road to Upgrading Transformer: 7. Length Extrapolation and Local Attention” on GAU, they all failed (their extrapolation ability was very poor)!

Why is this the case? The author’s first reaction was that these related works experimented with standard multi-head attention, while I experimented with GAU. From the perspective of attention mechanisms, the biggest feature of GAU is that it is single-headed (unlike the original GAU, the GAU I experimented with also uses softmax normalization). So the author felt that it was the difference between multi-head and single-head. Schemes like ALIBI, Sandwich, XPOS, etc., their parameter designs were indeed designed for multi-head, and their effectiveness on single-head needs to be verified.

However, after further verification, the author found that the difference between single-head and multi-head did not affect the length extrapolation capability as much as imagined, indicating that there must be other reasons. It wasn’t until a few days ago that the author realized another important difference: I have always used the Post Norm architecture, while mainstream work has moved to Pre Norm. In “Why is Pre Norm’s performance not as good as Post Norm?”, we analyzed that the depth of Pre Norm has a slight “water content”. So when a localization restriction is applied to each Attention layer, the features output by Pre Norm are actually more localized, thus the extrapolation effect is also better.

Therefore, from the current results, if the author insists on the combination of GAU+Post Norm, then the method in this article seems to be the only solution that can achieve length extrapolation. This is guaranteed by “translation invariance” and “independent and identically distributed”. The Window Attention in the first $L-1$ layers whose total receptive field does not exceed the training length causes “translation invariance”, resulting in a series of “independent and identically distributed” features. The last layer’s Full Attention performs a weighted average of these independent and identically distributed features. From a statistical point of view, the average result of independent and identically distributed variables can be stably extrapolated.

In addition, the author has also attempted to compare HWFA with other works under standard multi-head attention. Further results will be updated with everyone.

Further Thoughts
#

From the author’s experimental results, it can be seen that the combination of HWFA is slightly worse than the Baseline in terms of training performance. So a very natural concern is whether this difference will further widen as the model scale increases? Or, if the number of parameters increases to tens of billions or even hundreds of billions, will such a design have emergent capabilities like the standard design? This is indeed a concern for many people regarding various architectural modifications in the LLM era, namely the Scaling Law problem. Admittedly, before scaling HWFA up to the billion-parameter scale, this question has no definite answer, but a preliminary guess is that there might be a performance bottleneck.

Of course, HWFA can only be considered a Baseline for length extrapolation at present. Its main purpose is to achieve length extrapolation while retaining global dependency capabilities, which it appears to have the potential to do initially. The next step is to catch up HWFA’s training performance with the Baseline while retaining global dependency capabilities. In addition, HFWA can only capture global dependencies in the last Full Attention layer, which is estimated to have a performance bottleneck. But if more layers are used, it will lead to a decrease in length extrapolation capability, which is also a problem that needs to be optimized.

It is worth mentioning that since the Window Attention in the first $L-1$ layers has only a limited receptive field, it is theoretically possible to replace them with models like CNNs, as long as the total receptive field does not exceed the training length $N$. Therefore, exploring the combination of HWFA’s ideas with other basic architectures is also a direction worth considering.

Summary (formatted)
#

This article introduces a length extrapolation scheme conceived by the author. It achieves length extrapolation capability while retaining global dependency capabilities through the combination of Window Attention and Full Attention. It should be the only length extrapolation method that can be used in generative models and has global dependency capabilities.

@online{kexuefm-9603,
        title={The Road to Upgrading Transformer: 9. A New Approach to Global Length Extrapolation},
        author={苏剑林},
        year={2023},
        month={05},
        url={\url{https://kexue.fm/archives/9603}},
}