gemini-2.5-flash-preview-04-17
translation of a Chinese article. Beware of potential errors.Looking back, I realize that ever since the 7th article 《Transformer’s Journey: 7. Length Extrapolation and Local Attention》, the “Transformer’s Journey” series has been grappling with length extrapolation. A continuous stream of 9 articles (not including this one) has revolved around length extrapolation. Now, it’s just over a year since the 7th article, and during this year, the open-source community has made significant progress in research on length extrapolation. I have also gradually gained some insights, such as the fact that this problem is far less simple than initially imagined, and many previous works based on local attention are not always effective. This suggests that many old analysis works did not touch upon the core of the problem.
In this article, I will try to combine my findings and understanding to “review” the mainstream length extrapolation results and attempt to discover the key aspects of training-free length extrapolation.
Problem Definition#
As the name suggests, training-free length extrapolation means that without additional training on long sequence data, training the model only with short sequence corpora can yield a model capable of processing and predicting long sequences, i.e., “Train Short, Test Long”. So, how do we determine if a model can handle long sequences? The most basic metric is that the model’s Loss or PPL on long sequences does not explode. A more practical evaluation is to input a sufficiently long Context and let the model predict the answer, then compare it with the true answer to calculate metrics like BLEU, ROUGE, etc. LongBench is one such benchmark.
However, it is important to note that length extrapolation should not come at the cost of sacrificing long-range dependencies – otherwise, there would be no point in considering length extrapolation; it would be better to just truncate the text – meaning that schemes that explicitly truncate long-range dependencies need to be chosen carefully. Examples include ALIBI and most of the schemes listed in 《Transformer’s Journey: 7. Length Extrapolation and Local Attention》, as well as Linear RNN with explicit Decay. These schemes behave like local attention when the sequence length is sufficiently large, and even if they might achieve length extrapolation, there is a risk of insufficient long-range dependencies, requiring careful consideration based on your specific scenario.
How to judge whether long-range dependencies are lost while performing length extrapolation? A more rigorous approach is the evaluation scheme proposed at the end of 《Transformer’s Journey: 12. Infinite Extrapolation with ReRoPE?》, which prepares sufficiently long text but only calculates metrics for the last segment of each sample for each model, as shown in the figure below:
For example, if the model training length is 4K and we want to see the effect of extrapolating to 16K, we prepare a 16K tokens test set. For the 4K model, input the last 4K tokens of each sample to calculate metrics. For the 8K model, input the last 8K tokens of each sample but only calculate metrics for the last 4K tokens. For the 12K model, input the last 12K tokens of each sample but only calculate metrics for the last 4K tokens, and so on. This way, models of different lengths calculate metrics for the same segment of tokens, with the only difference being the input Context. If long-range dependencies are effectively preserved, the metrics should improve as the Context length increases.
Rotary Position Encoding#
Having discussed evaluation, let’s return to the methods. At the beginning of the article, we mentioned “old analysis works”. A key distinction between “new” and “old” works here is that “old” works mostly attempted to design new architectures or position encodings to achieve length extrapolation, while “new” works in the past year have mainly focused on the length extrapolation of Decoder-Only Transformer models with Rotary Position Encoding (RoPE).
As an aside, why do most current LLMs choose RoPE as their position encoding? I believe there are several reasons:
- RoPE does not have explicit long-range decay, which is crucial for models aiming for Long Context.
- RoPE is a true position encoding that effectively distinguishes between long-range and short-range information through different frequencies of trigonometric functions, achieving a hierarchical position encoding effect. This is also a key aspect in Long Context.
- RoPE acts directly on Q and K, without changing the form of Attention, which is more compatible with Flash Attention and easier to Scale Up.
In contrast, methods like ALIBI, KERPLE, etc., although sometimes called position encoding, are actually just a type of Attention Bias. They don’t carry much positional information and are not applicable to Encoders. Their usability in Decoders is largely due to the Decoder’s inherent lower-triangular Mask already providing sufficient positional Bias, making additional Attention Bias merely icing on the cake. Furthermore, they cannot effectively distinguish between long-range and short-range within a single head but rely on setting different Decay factors in different heads. This means they would perform poorly when used with single-head attention (e.g., GAU).
Talking so much about the pros and cons might seem like “the seller praising their own wares”, but it’s not. This is just to exchange views with everyone because some readers have raised the same questions before. As the proposer of RoPE, my understanding of RoPE is not necessarily deeper than everyone else’s. After all, the initial intention of proposing RoPE was purely for fun. At the time, I thought it would be great if it was effective, and achieving performance comparable to Learnable absolute position encoding would be excellent news. So, since it was “unexpected”, it’s “reasonable” that “the author himself didn’t have a thorough understanding”.
Window Truncation#
I seem to have drifted off-topic again. Simply put, the main points I wanted to convey in the previous two sections are: currently, RoPE seems sufficient for Long Context, so researching RoPE’s length extrapolation is valuable, and when choosing a length extrapolation scheme, we should not sacrifice the ability for long-range dependencies.
In the earliest article on this site discussing length extrapolation, 《Transformer’s Journey: 7. Length Extrapolation and Local Attention》, we concluded that length extrapolation is an OOD (Out Of Distribution) problem during the prediction phase. Although some of the comments in that article appear a bit outdated from today’s perspective, this fundamental judgment remains correct. Applied to RoPE, this means relative distances not seen during training appear during inference. For this, a seemingly feasible approach is to introduce a Sliding Window Attention Mask, as shown on the left below:
Of course, by forcibly truncating attention outside the window, this scheme does not satisfy the principle of “not sacrificing the ability for long-range dependencies”. However, we can treat it as a Baseline. Unfortunately, despite this sacrifice, this scheme does not work – it fails to even achieve the most basic requirement of PPL not exploding! A deep analysis of this phenomenon led to two papers, 《LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models》 and 《Efficient Streaming Language Models with Attention Sinks》, which provided almost the same answer. In fact, several months earlier, an “outsider” had discovered the same conclusion and published it in the Zhihu column article 《Perpetual Sampling Technical Report》.
The answer might be surprising: the first few Tokens are very important and cannot be discarded. Therefore, the usable Window Mask should be as shown on the right above (the LM-Infinite paper calls it the “$\Lambda$-Mask”).
Why do the initial Tokens hold such significant importance? Currently, there are two different perspectives:
- The initial few Tokens are “anchor points” for absolute position: As the name suggests, relative position encoding can in principle only identify relative positions. However, some tasks may rely heavily on absolute position. By using the first few Tokens, whose absolute positions are approximately 0, as “references”, each Token can determine its own absolute position. Removing the first few Tokens eliminates this link, completely disrupting the attention pattern and leading to PPL explosion.
- The initial few Tokens are the attention “recycling bin”: Since attention sums to 1, attention must be allocated to certain Tokens. However, in some cases, the model may find that “no Token is worth paying attention to”. At this point, it chooses to allocate a portion of attention to the first few Tokens, which have little information, serving the purpose of “not paying attention”. Removing them forces the model to allocate attention to other irrelevant Tokens, thus disrupting the attention pattern.
In essence, empirical results show that in most cases, the attention proportion of the first few Tokens is still quite high, so they cannot be removed. Removing them messes up the attention entirely. As for why they are very important, that’s open to everyone’s imagination.
Positional Interpolation#
While window truncation can serve as a decent baseline for length extrapolation, and the “anchor point” or “recycling bin” results provide further insight into how attention mechanisms work, as mentioned earlier, this is achieved by forcibly truncating attention outside the window and sacrificing long-range dependencies. Therefore, it is not the final solution.
The OOD of relative positions directly manifests as relative positions in the prediction phase exceeding the range seen during training. Since they haven’t been trained on, the behavior of the “out-of-bounds” part is unpredictable. To address this, a netizen named “kaiokendev” proposed a very simple solution in his blog 《https://kaiokendev.github.io/til#extending-context-to-8k》 – “positional interpolation” – scaling the position encoding of the long text during prediction by a factor of $\frac{L_{train}}{L_{test}}$, bringing it within the training length range, as shown in the following equation (positions in the equation are relative positions). Soon after, Meta also published the same method in the paper 《Extending Context Window of Large Language Models via Positional Interpolation》, naming it “Positional Interpolation (PI)”, and supplemented it with more sufficient experimental results.
$$ \begin{aligned} &\text{Training Phase}:\,(1,2,\cdots,n-1,n)\\[5pt] &\text{Prediction Phase}:\,(1,2,\cdots,n,\underbrace{n+1,\cdots,4n-1,4n}_{\text{Distant Out-of-Bounds}})\xrightarrow{\quad\text{Interpolation}\quad} \big(\underbrace{\frac{1}{4},\frac{2}{4},\frac{3}{4}}_{\text{Local Distortion}},\cdots,n-\frac{1}{4},n\big)\end{aligned} $$However, positional interpolation is not considered a length extrapolation scheme, at least not a training-free one, because positional interpolation also leads to PPL explosion. The reason is not hard to understand: although positional interpolation avoids the problem of distant positional OOD, it simultaneously compresses the distance between adjacent Tokens, severely disrupting the model’s local resolution. As is well known, language modeling is a task highly dependent on local relationships, so disrupting the local structure naturally makes prediction inaccurate.
Nevertheless, this does not mean positional interpolation is without value. We know that readers interested in length extrapolation fall into two categories: one is those without resources for long text fine-tuning, who hope to directly obtain a usable long text model from a short text model. This need has a high requirement for the effect of length extrapolation, making positional interpolation unsuitable for them. The other category is those with resources for long text fine-tuning, who research length extrapolation purely to obtain a better initialization model. This situation has a higher tolerance for initial losses caused by model modifications, as long as the lost performance can be quickly recouped through fine-tuning. Positional interpolation fits this category perfectly. Meta’s paper shows that after PI, a functional long text model can be obtained with only about 1000 steps of long text training, which is much more efficient than training directly without any modifications.
Preserve Nearby, Compress Far#
The problem with direct extrapolation is distant out-of-bounds, while the problem with positional interpolation is local distortion. It seems they are complementary. Can we combine their strengths? This is what was proposed in 《Transformer’s Journey: 12. Infinite Extrapolation with ReRoPE?》, namely Leaky ReRoPE, and its extreme version ReRoPE.
Based on the analysis in the previous section, it’s not difficult to infer that the key to achieving training-free length extrapolation is “preserve nearby, compress far”. This means “ensure local non-distortion” and “compress distant non-out-of-bounds”. Leaky ReRoPE achieves this through a very direct idea: it first sets a window size $w$ and divides relative positions into two parts. Within the window, relative positions are unchanged to achieve “local non-distortion”. Outside the window, positional interpolation is used to achieve “distant non-out-of-bounds”, as shown in the equation below:
$$ \begin{pmatrix} 0 & \\ 1 & 0 & \\ 2 & 1 & 0 & \\ \ddots & 2 & 1 & 0 & \\ w - 1 & \ddots & 2 & 1 & 0 & \\ w & w - 1 & \ddots & 2 & 1 & 0 & \\ w + \frac{1}{k} & w & \ddots & \ddots & 2 & 1 & 0 & \\ w + \frac{2}{k} & w + \frac{1}{k} & \ddots & \ddots & \ddots & 2 & 1 & 0 & \\ \ddots & w + \frac{2}{k} & \ddots & \ddots & \ddots & \ddots & 2 & 1 & 0 & \\ \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \\ \ddots & \ddots & \ddots & w + \frac{2}{k} & w + \frac{1}{k} & w & w - 1 & \ddots & 2 & 1 & 0 & \\ w + \frac{L-1-w}{k} & \ddots & \ddots & \ddots & w + \frac{2}{k} & w + \frac{1}{k} & w & w - 1 & \ddots & 2 & 1 & 0 & \\ \end{pmatrix} $$If the interpolation factor $k$ is taken to infinity, we get the minimalist ReRoPE, where position encodings outside the window all become $w$. This means that for sequences of any length, there will be no out-of-bounds positions, theoretically giving it the potential for infinite extrapolation! In fact, Leaky ReRoPE and ReRoPE perform very well. Looking at the Loss, they achieve almost no loss of performance within the training length, and they successfully achieve length extrapolation. Furthermore, the longer the Context, the lower the Loss, indicating that they indeed preserve long-range dependencies while extrapolating.
The main issue with Leaky ReRoPE and ReRoPE is that their code implementation is slightly more complicated. Unlike Attention Bias-based position encodings, RoPE cannot be implemented by first constructing a relative position matrix and then calculating relative position encoding (that would be too inefficient). It can only implement relative position encoding through absolute position encoding, which means it can only implement linearly increasing relative positions. Leaky ReRoPE and ReRoPE have piecewise linear relative positions. This means a naive implementation would require calculating the Attention matrix twice (to get two different linear segments) and then concatenating them, which undoubtedly significantly reduces efficiency.
However, the good news is that current mainstream Attention acceleration methods like Flash Attention compute Attention in blocks, for example, 128 length per block. This way, for sufficiently long sequences, the proportion of piecewise linear blocks is very small (only near the window boundary), as shown in the equation below. Only the red and green mixed blocks require repeated Attention calculations; the remaining same-colored blocks only need to be calculated once. Therefore, combined with block-wise Attention calculation, the additional computational cost of Leaky ReRoPE and ReRoPE is almost negligible. Previously, reader @chu-tianxiang also shared a Triton-based implementation in the comments section that readers interested can refer to.
$$ \left(\begin{array}{cccc:cccc:cccc} 0 & \\ 1 & 0 & \\ 2 & 1 & 0 & \\ \ddots & 2 & 1 & 0 & \\ \hdashline w - 1 & \ddots & 2 & 1 & 0 & \\ w & w - 1 & \ddots & 2 & 1 & 0 & \\ w + \frac{1}{k} & w & \ddots & \ddots & 2 & 1 & 0 & \\ w + \frac{2}{k} & w + \frac{1}{k} & \ddots & \ddots & \ddots & 2 & 1 & 0 & \\ \hdashline \ddots & w + \frac{2}{k} & \ddots & \ddots & \ddots & \ddots & 2 & 1 & 0 & \\ \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \\ \ddots & \ddots & \ddots & w + \frac{2}{k} & w + \frac{1}{k} & w & w - 1 & \ddots & 2 & 1 & 0 & \\ w + \frac{L-1-w}{k} & \ddots & \ddots & \ddots & w + \frac{2}{k} & w + \frac{1}{k} & w & w - 1 & \ddots & 2 & 1 & 0 & \\ \end{array}\right) $$Coincidentally, a paper submitted to Arxiv earlier this month, 《LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning》, proposed a training-free length extrapolation method called “Self-Extend”. It is essentially Leaky ReRoPE with a Round operation added, making each relative position an integer again, further alleviating the relative position OOD problem. The paper reports excellent results, further confirming the effectiveness of Leaky ReRoPE.
Rotating Perspective#
Although Leaky ReRoPE and ReRoPE perform quite well in practice (at least in terms of Loss), like positional interpolation, they directly operate on position numbers (Position Ids). This feels like “treating the head when the head aches, and treating the foot when the foot hurts”, lacking a deep analysis of the underlying principles. Because for the model, position numbers are not important; position embeddings are what interact directly with the model. Therefore, to get closer to the “root cause”, we should try to start from position embeddings.
Some readers might ask: Aren’t position numbers and position embeddings in one-to-one correspondence? Isn’t operating on position numbers equivalent to operating on position embeddings? That’s true, but their actual behavior is different. For example, position numbers are unbounded, but position embeddings are bounded (RoPE is composed of trigonometric functions, which are bounded). What interacts directly with the model is the position embedding. If the position number is OOD, the position embedding is not necessarily OOD. Therefore, analyzing from the perspective of position embeddings can provide a clearer understanding of what the OOD caused by length extrapolation specifically looks like, and thus allows for more targeted treatment.
In 《Transformer’s Journey: 2. Rotary Position Embedding: A Blend of Strengths》, when we derived RoPE, we first used complex numbers to derive a 2D solution, and then concatenated multiple 2D solutions to form a high-dimensional solution. This way, the inner product of $\boldsymbol{q},\boldsymbol{k}$ after adding RoPE can be expressed using complex numbers as:
$$ (\boldsymbol{\mathcal{R}}_m \boldsymbol{q})^{\top}(\boldsymbol{\mathcal{R}}_n \boldsymbol{k}) = \text{Re}\left[\sum_{i=0}^{d/2-1}\boldsymbol{q}_{[2i:2i+1]}\boldsymbol{k}_{[2i:2i+1]}^* e^{\text{i}(m-n)\theta_i}\right] $$Here, $\theta_i$ is default $10000^{-2i/d}$, which is a function that gradually changes from 1 to close to 0. From Euler’s formula $e^{\text{i}t}=\cos t + \text{i}\sin t$, we know that $e^{\text{i}(m-n)\theta_i}$ is actually a point on the unit circle. As $m-n$ gradually increases, this point rotates on the unit circle (true rotation). A larger $\theta_i$ means faster rotation and a shorter period, and vice versa.
Assuming the training length is $L_{train}$, then $m-n\in[0, L_{train}-1]$. Now let’s use our imagination fully: a larger $\theta_i$ means a faster rotation speed and a shorter period. Thus, during $m-n$ from $0$ to $L_{train}-1$, it has rotated many times, meaning almost every point on the circle has been trained. Therefore, these $\theta_i$ almost have no OOD problem. Conversely, for a smaller $\theta_i$, it may not have completed one rotation when $m-n$ goes from $0$ to $L_{train}-1$. In this case, the trained points are at most an arc on the circle. If a larger $L_{test}$ is encountered during testing, it exceeds the range of the trained arc, leading to unpredictable behavior. At this point, interpolation is needed to compress it back into the original arc. In short, whether the position index $m-n$ is OOD is not important at all; what matters is whether the points on the unit circle have been fully trained. If they have, no modification is needed (direct extrapolation). Otherwise, one must find a way to compress it into the arc that has been fully trained (positional interpolation).
Specifically, for $\theta_i$, we can calculate the period as $T_i=2\pi/\theta_i$. Then we can calculate the “number of rotations” during training as $r_i=\frac{L_{train}}{T_i}=\frac{\theta_i L_{train}}{2\pi}$. We can set a threshold for the number of rotations, $\tau$. If the number of rotations exceeds $\tau$, it is considered fully trained and can be used without modification. If the number of rotations is less than 1, $\theta_i$ is changed to $\frac{\theta_i L_{train}}{L_{test}}$, meaning that the part exceeding the arc range needs to be scaled back into the arc. As for the remaining part, linear interpolation is used to transition between the two. This can be expressed by the formula:
$$ \theta_i^{new} = \left[\gamma_i + (1 - \gamma_i)\frac{L_{train}}{L_{test}}\right]\theta_i,\quad \gamma_i = \left\{\begin{aligned}&1,&r_i > \tau \\ &0,&r_i < 1 \\ &\frac{r_i - 1}{\tau - 1},&\text{others} \end{aligned}\right. $$This is the training-free length extrapolation scheme “YaRN” proposed in the paper 《YaRN: Efficient Context Window Extension of Large Language Models》. In my tests, its extrapolation effect is very good, only slightly inferior to Leaky ReRoPE and ReRoPE. However, it should be noted that YaRN only changes the value of $\theta_i$ and does not change the form of Attention or RoPE. Therefore, there is no additional implementation cost or inference cost. Under this condition (i.e., it can be fully integrated into existing implementations), YaRN is the best length extrapolation method I have tested.
Some Interludes#
Actually, the story of YaRN is not over yet, but I feel the previous section was already quite long, so it’s better to start a new section. Besides introducing the change in $\theta_i$, YaRN also multiplied the Attention Logits by an additional Scale factor:
$$ \lambda = \left(1 + 0.1 \log \frac{L_{test}}{L_{train}}\right)^2\label{eq:scale-yarn}\approx 1 + 0.2 \log \frac{L_{test}}{L_{train}} $$The derivation of this Scale might be somewhat humorous; the answer is there was no derivation at all. The author said he couldn’t derive it theoretically either and that it was purely an experimental finding that adding the above Scale resulted in lower PPL. The above form was also fitted through experiments.
In fact, this result containing a logarithm is clearly very similar to the $\log n$ Scale derived in 《Understanding Attention’s Scale Operation from Entropy Invariance》. The only difference is that the latter is related to the specific position, while the former is a constant after $L_{test}$ is determined. Considering that when $n$ is relatively large, the $\log n$ function changes slowly, it is acceptable to treat it as a constant within a certain range. Therefore, it is not difficult to guess that YaRN’s Scale factor and the entropy invariance $\log n$ Scale should have the same origin. I have also made comparisons and found that replacing the constant $\lambda$ with the following factor related to the absolute position $n$ can achieve a similar effect:
$$ \lambda_n = \max\left(1, \frac{\log n}{\log L_{train}}\right)\label{eq:clip-logn} $$Note that
$$ \frac{\log L_{test} }{\log L_{train}} = 1 + \frac{1}{\log L_{train}} \log\left(\frac{L_{test}}{L_{train}}\right) $$YaRN conducted experiments based on LLAMA and LLAMA2. The training length for the former is 2K, and for the latter is 4K. We have $1/\log 2048 \approx 0.13$ and $1/\log 4096 \approx 0.12$. The coefficient is roughly half of that in equation $\text{eq:scale-yarn}$, which is not a big difference. In fact, the exact value of this coefficient may not be very important, as I have also found datasets where equation $\text{eq:clip-logn}$ performs better. Thus, we have approximately derived equation $\text{eq:scale-yarn}$.
Compared to YaRN itself, the story of YaRN’s author, Bowen Peng, is perhaps even more “fascinating”. His earlier proposed NTK-RoPE was the first training-free length extrapolation scheme for RoPE. Two blog posts in this series, 《Transformer’s Journey: 10. RoPE is a β-ary Encoding》 and 《Transformer’s Journey: 11. Carrying the β-ary Position to the End》, were directly inspired by it. Although from the current perspective, the effect of NTK-RoPE may not be particularly good (compared to YaRN, ReRoPE, etc.), it was the first to demonstrate the possibility of training-free length extrapolation and is of landmark significance. One could even say that all subsequent research related to length extrapolation directly or indirectly benefited from NTK-RoPE opening up everyone’s imagination.
NTK-RoPE’s idea is very simple: just change the base of RoPE. The original was $\theta_i = 10000^{-2i/d}$, and now it is changed to $\theta_i = (10000\kappa)^{-2i/d}$. How is $\kappa$ chosen? At that time, Bowen Peng, based on his experience with NTK (Neural Tangent Kernel) related results, judged that high frequencies ($i\to 0$) learn relative distance and therefore should not be changed, while low frequencies ($i\to d/2-1$) learn absolute distance and therefore need interpolation. In summary, it is “high frequency extrapolation, low frequency interpolation”. So he set the Scale for $i = d/2-1$ to be exactly equal to the interpolation Scale $\frac{L_{train}}{L_{test}}$, yielding the equation:
$$ (10000\kappa)^{-2i/d}|_{i=d/2-1} = \left.\frac{L_{train}}{L_{test}}10000^{-2i/d}\right|_{i=d/2-1} $$Solving for $\kappa$ gives:
$$ \kappa = \left(\frac{L_{test}}{L_{train}}\right)^{d/(d-2)}\label{eq:kappa} $$This simple yet brilliant derivation opened the “Pandora’s Box” of training-free length extrapolation.
From YaRN’s perspective, it’s not only the $\theta_i$ at $i = d/2-1$ that rotates less than one cycle. Therefore, NTK-RoPE only performing full interpolation for the last $i = d/2-1$ is not sufficient. In fact, this is indeed the case. Setting $\kappa$ as in equation $\text{eq:kappa}$ can only allow the model to extrapolate to a length around $L_{test}/2$ without PPL explosion; beyond that, PPL increases significantly. It is because of this problem that the author further proposed the subsequent upgraded scheme YaRN.
However, despite NTK-RoPE’s performance being inferior to YaRN, readers in the second category mentioned earlier (those with resources for long text fine-tuning) might prefer NTK-RoPE. Since they only want a better initialization model and intend to fine-tune anyway, they won’t care too much about the initial performance difference between NTK-RoPE and YaRN. Compared to that, they are more willing to choose the simpler-to-implement NTK-RoPE. For example, CodeLLAMA is based on LLAMA2, with the base changed to $10^6$ and then further trained. Additionally, Meta, in their paper 《Effective Long-Context Scaling of Foundation Models》, renamed NTK-RoPE to RoPE-ABF (Adjusted Base Frequency). Compared to the mysterious NTK, the name ABF more intuitively reflects its meaning.
Refusing to Pay Tax#
I wonder if everyone has noticed that the training-free length extrapolation methods mentioned above cannot keep the model’s performance within the training length $L_{train}$ unchanged. Specifically, let the original model be $f(x)$, and the modified model with extrapolation changes be $f^+(x)$. When the length of $x$ does not exceed $L_{train}$, it cannot be guaranteed that $f(x)\equiv f^+(x)$. Since $f(x)$ was trained within $L_{train}$, it is reasonable to assume that $f(x)$ has optimal performance for samples with length not exceeding $L_{train}$. Thus, $f^+(x)\neq f(x)$ means that while length extrapolation improves the performance on longer samples, the performance within the original $L_{train}$ range deteriorates. We can figuratively call this part of the loss the “extrapolation tax”.
As early as when NTK-RoPE was just proposed, the open-source community realized the problem of the “extrapolation tax” and proposed a corresponding solution – dynamically adjusting the Scale factor of various extrapolation methods as the training length changes. This is Dynamic Scaling, first proposed in a Reddit post 《Dynamically Scaled RoPE further increases performance of long context LLaMA with zero fine-tuning》. Taking YaRN as an example, the length-dependent scaling factor is $s=\frac{L_{test}}{L_{train}}$. Dynamic Scaling replaces it with a dynamic $s(pos)=\frac{\max(L_{train}, pos+1)}{L_{train}}$, where $pos$ is the position index of the current Token (starting from zero). This modification means Dynamic Scaling attempts to find the minimum scale factor for each position that theoretically has the least impact on model performance (or equivalently, each position is assigned a different $\theta_i(pos)$), thereby achieving the effect of refusing to pay tax.
However, truly achieving a different $\theta_i(pos)$ for each position is difficult. For the same reason that Leaky ReRoPE and ReRoPE require repeated Attention calculations, because RoPE implements relative position through absolute position, it means a single calculation can only implement a fixed $\theta_i$. To implement different $\theta_i$ for different positions, the K in KV Cache can only store the K before Apply RoPE, and different positions must be calculated multiple times separately. This becomes a recursive process similar to RNNs. We know that LLM responding to one round of dialogue can be divided into two stages: prefill and generation. Prefill refers to the calculation for the input part, and generation is the token-by-token generation stage. It is clear that the prefill stage was originally parallelizable. If it is also changed to be recursive like generation, then when the input is very long (e.g., inputting a paper), it will undoubtedly significantly slow down the calculation speed, thus becoming impractical.
Therefore, a compromise is “local static”: we can calculate how many Tokens are in the input during the prefill stage, and then we will also set a max_gen_tokens for the generation stage. We add these two numbers to calculate the corresponding $L_{test}$ and then $\theta_i$ for the current round of dialogue. After this round of dialogue is completed, in the next round, we update $L_{test}$ and $\theta_i$ in the same way. This way, we don’t need to introduce overly complex implementations or sacrifice efficiency for the model. It is a relatively practical solution, especially when the input is very long and max_gen_tokens is far less than prefill tokens, the Scale in a single conversation round is already approximately constant.
The idea of Dynamic Scaling can be said to have been maximized by CLEX, proposed in 《CLEX: Continuous Length Extrapolation for Large Language Models》: CLEX also needs to assign a different $\theta_i(pos)$ to each position. It assumes $\theta_i(pos)$ is a continuous function of $pos$ and models it using a neural ODE. By fine-tuning, it fits the parameters of this ODE, ultimately achieving better results than YaRN. Furthermore, experimental results show that continuously applying Dynamic Scaling can yield almost infinite length extrapolation capability.
Starting Anew#
Besides Dynamic Scaling, another approach to “refuse to pay tax” is to “start anew” by redesigning the model architecture used during pre-training so that it has the potential to achieve length extrapolation without any modifications after training is completed. In this series of articles, I have two related discussions, namely HWFA (Hybrid Window-Full Attention) mentioned in 《Transformer’s Journey: 9. A New Approach to Global Length Extrapolation》, and Key Norm, which was verified in 《Transformer’s Journey: 15. Key Normalization Aids Length Extrapolation》.
In HWFA, the first $L-1$ layers of Attention are replaced with RoPE + Window Attention with a small window, while the last layer of Attention is replaced with NoPE + Full Attention. Models trained with this modification have a certain degree of length extrapolation capability without further changes. A similar idea is contained in 《Focused Transformer: Contrastive Training for Context Scaling》, although this paper is not about length extrapolation but rather expanding the Context length of LLM through simple fine-tuning. The problem with HWFA is that its training performance is inferior to the standard Attention model. To address this, I later proposed an improved version HWFA2 (i.e., HWFA + ReRoPE) in 《Transformer’s Journey: 14. When HWFA Meets ReRoPE》.
Compared to HWFA, HWFA2 uses a larger Window Size for Window Attention and restores RoPE for Full Attention. It also allows more than one layer of Full Attention interspersed among Window Attention layers (not just one layer at the end). These modifications can close the gap in training performance with standard Attention (and occasionally even surpass it), but the drawback is that it can no longer achieve length extrapolation without modification (RoPE needs to be replaced with ReRoPE). So there are gains and losses. Of course, we can also ignore the extrapolation effect and simply view HWFA2 as an acceleration method that does not lose performance and significantly reduces model complexity. By the way, a paper on Arxiv last month, 《Zebra: Extending Context Window with Layerwise Grouped Local-Global Attention》, proposed a method called Zebra, which also uses a combination of several Full Attention layers interspersed with Window Attention layers, similar to HWFA2.
As for Key Norm, it originated from the “unexpected discovery” that normalizing the Attention Key with L2 significantly improved the model’s length extrapolation ability. Further thinking about this deepened my understanding of length extrapolation. For standard Attention based on Q, K inner product, we can express it as:
$$ s(n|m) = \boldsymbol{q}_m\cdot \boldsymbol{k}_n = \Vert\boldsymbol{q}_m\Vert \Vert\boldsymbol{k}_n\Vert \cos(\boldsymbol{q}_m,\boldsymbol{k}_n),\quad p(j|i) = \frac{\exp\left(\frac{s(n|m)}{\sqrt{d}}\right)}{\sum\limits_{j=1}^i \exp\left(\frac{s(n|m)}{\sqrt{d}}\right)} $$Clearly, to increase the relative attention of $n$ to some $m$, the model has two options: increase $\Vert\boldsymbol{k}_n\Vert$, or increase $\cos(\boldsymbol{q}_m,\boldsymbol{k}_n)$. Due to the curse of dimensionality, increasing $\Vert\boldsymbol{k}_n\Vert$ is easier than increasing $\cos(\boldsymbol{q}_m,\boldsymbol{k}_n)$. Therefore, if possible, the model will try to increase $\Vert\boldsymbol{k}_n\Vert$ as much as possible. $\Vert\boldsymbol{k}_n\Vert$ is independent of $i$ and describes absolute importance. This may be one of the reasons for the attention distribution characteristics described by Scissorhands. On the other hand, the model’s tendency to increase $\Vert\boldsymbol{k}_n\Vert$ means that the training of $\cos(\boldsymbol{q}_m,\boldsymbol{k}_n)$ may be insufficient. This is probably the more fundamental reason why Attention cannot achieve length extrapolation.
Thus, the reason why Key Norm can improve length extrapolation ability becomes clear. Key Norm normalizes all $\Vert\boldsymbol{k}_n\Vert$ to 1, so the model no longer has the option to “increase $\Vert\boldsymbol{k}_n\Vert$”. It can only focus on adjusting $\cos(\boldsymbol{q}_m,\boldsymbol{k}_n)$, making the training of $\cos(\boldsymbol{q}_m,\boldsymbol{k}_n)$ more sufficient. At the same time, I have also conducted comparative experiments and found that Key Norm only exhibits length extrapolation capability when combined with RoPE. Key Norm + NoPE or simply NoPE do not have length extrapolation effects. This is probably due to RoPE’s own rotation effect, which enriches the diversity of the angle between $\boldsymbol{q}_m$ and $\boldsymbol{k}_n$ (like data augmentation), thereby making the training of $\cos(\boldsymbol{q}_m,\boldsymbol{k}_n)$ more sufficient.
Another interesting paper named 《CoCA: Fusing position embedding with Collinear Constrained Attention for fine-tuning free context window extending》 proposes a solution from a different angle: it modifies the implementation of attention such that for each group $\boldsymbol{q}_m^{(i)},\boldsymbol{k}_m^{(i)}$, $\cos(\boldsymbol{q}_m^{(i)},\boldsymbol{k}_m^{(i)})=1$. Here, group $i$ refers to the pairwise grouping of RoPE components of $\boldsymbol{q}$ and $\boldsymbol{k}$ mentioned earlier. This design ensures that relatively large values of $\cos(\boldsymbol{q}_m,\boldsymbol{q}_n)$ are trained as much as possible ($\cos$ maximum is 1), and only the small parts are not fully trained (these parts will have smaller probabilities after Softmax and will not significantly interfere with the attention distribution), thereby achieving some degree of length extrapolation capability. However, CoCA’s modification to attention might risk reducing the maximum capacity of each attention head, meaning that with the same parameters, it might only have the fitting capacity of a standard attention head with head_size/2.
Other Ideas#
Writing up to this point, the introduction to length extrapolation techniques is coming to an end. Although I have written a considerable length, it is still difficult to provide detailed introductions for all length extrapolation works. Below, I will list some other related works that come to mind.
Initially, we believed that Attention could not extrapolate length because of “out-of-bounds” positions during prediction. A simple solution for this is to perturb the position encoding during the training phase, similar to data augmentation, in an attempt to make the model adapt to the position encoding used during prediction in advance. 《Transformer’s Journey: 8. Length Extrapolation and Position Robustness》 and 《Transformer’s Journey: 13. Inverse Use of Leaky ReRoPE》 in this series fall into this category, as does 《PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training》 from a few months ago. These methods were not very stable in my experiments and introduced additional complexity or randomness, making it difficult to guarantee that they wouldn’t affect the model’s original Scaling Law.
Some readers have raised the question: In the analysis of YaRN, the low-frequency part needs interpolation. What if we just remove the low-frequency part entirely? Or similarly, reduce the base to increase the proportion of high-frequency parts? I have tried reducing the base of RoPE during pre-training, and the result was that the final performance was worse, and it did not show length extrapolation ability. However, 《Scaling Laws of RoPE-based Extrapolation》 (a Chinese version is available on Zhihu 《Scaling Laws of RoPE Extrapolation – Attempting to Extrapolate RoPE to 1M Context》) experimented with another scheme, which is to reduce the base only during the fine-tuning phase. Combined with short text fine-tuning, it can exhibit long text extrapolation capability.
However, from my perspective, reducing the base or even removing low frequencies is not scientifically sound. Even if it might have length extrapolation effect in certain cases, it might sacrifice the model’s inherent capabilities. As the author of NTK-RoPE and YaRN, Bowen Peng, once pointed out, high frequencies learn local relative distance, and low frequencies learn remote absolute distance. Both are important, and they are more like a hierarchical relationship. From the perspective of the base type in 《Transformer’s Journey: 10. RoPE is a β-ary Encoding》, low frequencies correspond to high digits. If we only keep the low digits and remove the high digits, the result is equivalent to taking the modulo (remainder), which cannot accurately express positional information. Moreover, high and low frequencies are relative; a frequency that is low frequency for 10K length text might be high frequency for 100K length text.
Recently, there is also an interesting paper 《Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use》, which found that averaging the outputs of the same model with different bases can enhance the overall performance of the model. This indicates that different sizes of the base each have their strengths and should not be simply reduced for the sake of extrapolation.
In summary, although length extrapolation technology has made significant progress, it remains a very mysterious matter. For example, replacing RoPE with ReRoPE during the inference phase can show a certain length extrapolation effect. So, if we replace it with ReRoPE during the pre-training phase, will the extrapolation effect be even better? On the contrary, I have conducted experiments where ReRoPE was used during the training phase, and the resulting model did not have any length extrapolation capability at all. This is largely related to the analysis in the Key Norm section. Replacing with ReRoPE during the training phase reduces the diversity of the angle between $\boldsymbol{q}_n$ and $\boldsymbol{k}_m$, which反而 makes the training of $\cos(\boldsymbol{q}_n,\boldsymbol{k}_m)$ less sufficient, thereby reducing length extrapolation capability. Many length extrapolation techniques may also be tied to the architecture. Some position encodings that were reportedly capable of length extrapolation in earlier years, including ALIBI, KERPLE, XPOS, etc., were tested on Multi-Head Attention + Pre Norm. However, on Single Head GAU + Post Norm, I have never measured any length extrapolation capability for them. This suggests that the analysis of length extrapolation may still be missing the architecture-dependent part.
Summary (formatted)#
In this article, combining my learning experience, I have reviewed the related progress on length extrapolation over the past year, primarily introducing the characteristics of related methods and the ideas behind them in a concise way, and attempting to connect them. I hope this article can help everyone gain a deeper and more systematic understanding of the topic of length extrapolation.
@online{kexuefm-9948,
title={Transformer's Journey: 16. A "Review" of Length Extrapolation Techniques},
author={苏剑林},
year={2024},
month={01},
url={\url{https://kexue.fm/archives/9948}},
}