Skip to main content

11. Carrying the Base-β Position to the End

·1464 words
Table of Contents
Road to a better Transformer - This article is part of a series.
Part 11: This Article
This is a gemini-2.5-flash-preview-04-17 translation of a Chinese article. Beware of potential errors.

In the article 《Transformer Upgrade Path: 10. RoPE is a Base-$\beta$ Encoding》, we provided a base-$\beta$ interpretation of RoPE and, based on the idea of base conversion, derived the NTK-aware Scaled RoPE which can extend the Context length without fine-tuning. It must be said that understanding position encoding through the analogy of base-$\beta$ is indeed a very beautiful and inspiring perspective, so much so that every time I delve deeper and reflect on it, I seem to gain new insights and harvests.

This article will revisit the base-$\beta$ interpretation of RoPE and attempt to generalize the existing NTK-aware Scaled RoPE in the hope of finding a better strategy for extending LLM’s Context length without fine-tuning.

Base Analogy
#

We know that RoPE’s parameterization follows the form of Sinusoidal position encoding. Whether by coincidence or design, the Sinusoidal position encoding of an integer $n$ shares many similarities with its base-$\beta$ encoding.

Specifically, the $m$-th digit (counting from right to left) of the base-$\beta$ representation of an integer $n$ is:

$$ \left\lfloor\frac{n}{\beta^{m-1}}\right\rfloor\bmod\beta\tag{1} $$

While its Sinusoidal position encoding is

$$ \boldsymbol{p}_n=\big[\cos\theta_1,\sin\theta_1,\cos\theta_2,\sin\theta_2,\cdots,\cos\theta_{d/2},\sin\theta_{d/2}\big]\\[5pt]\theta_m = \frac{n}{\beta^{m-1}},\quad \beta=10000^{2/d}\tag{2} $$

As can be seen, both have the same $\frac{n}{\beta^{m-1}}$, and $\bmod$ and $\cos,\sin$ are both periodic functions, so the only difference between the two is the insignificant floor function $\lfloor\cdot\rfloor$. Therefore, it is very intuitive and reasonable to analogize RoPE/Sinusoidal position encoding to its base-$\beta$ representation.

Correcting NTK
#

Following the idea from 《Transformer Upgrade Path: 10. RoPE is a Base-$\beta$ Encoding》, direct extrapolation concentrates the extrapolation pressure on the “high bits” ($m$ is large), while position interpolation makes the representation of “low bits” ($m$ is small) denser, which is not conducive to distinguishing relative distances. NTK-aware Scaled RoPE is essentially a base conversion, which spreads the extrapolation pressure across each bit and keeps adjacent intervals unchanged. These characteristics are very friendly and crucial for LLMs that are clearly more inclined to rely on relative positions, so it can achieve certain results without fine-tuning.

Looking closely at equation $\text{(2)}$, $\cos,\sin$ are actually a whole, so it actually only has $d/2$ bits, which means it is equivalent to $n$’s $d/2$ bits of base-$\beta$ encoding. If we want to extend to $k$ times the Context, converting from base-$\beta$ to base-$\beta\lambda$, then we should at least have

$$ \lambda^{d/2}=k\quad\Rightarrow\quad\lambda = k^{2/d} $$

Thus the new RoPE becomes

$$ \boldsymbol{p}_n=\big[\cos\theta_1,\sin\theta_1,\cos\theta_2,\sin\theta_2,\cdots,\cos\theta_{d/2},\sin\theta_{d/2}\big]\\[5pt]\theta_m = \frac{n}{(\beta\lambda)^{m-1}},\quad \beta=10000^{2/d},\quad \lambda = k^{2/d}\tag{3} $$

This is the NTK-RoPE we proposed in the previous article.

However, after careful consideration, the author found this is still not reasonable enough. Going back to equation $\text{(1)}$, to calculate the $m$-th digit of the base-$\beta\lambda$ number, it should be

$$ \left\lfloor\frac{n}{(\beta\lambda)^{m-1}}\right\rfloor\bmod(\beta\lambda) $$

This means that in addition to replacing $\frac{n}{\beta^{m-1}}$ with $\frac{n}{(\beta\lambda)^{m-1}}$, the period for $\bmod$ also needs to be expanded by $\lambda$. This is equivalent to dividing by an extra $\lambda$ before calculating $\cos,\sin$:

$$ \boldsymbol{p}_n=\big[\cos\theta_1,\sin\theta_1,\cos\theta_2,\sin\theta_2,\cdots,\cos\theta_{d/2},\sin\theta_{d/2}\big]\\[5pt]\theta_m = \frac{n}{\lambda(\beta\lambda)^{m-1}},\quad \beta=10000^{2/d},\quad \lambda = k^{2/d}\tag{4} $$

In subsequent experiments, we will refer to equation $\text{(3)}$ from the previous article as “NTK-RoPE-old”, and equation $\text{(4)}$ as “NTK-RoPE-fixed”.

Mixed Radix
#

Now, let’s be more “wild” - since we can use base-$\beta$ to represent positions, why not simply use a more generalized “mixed radix” system? Mixed radix here means that the base used for each digit is not the same. This is not uncommon for us; for example, 60 seconds is 1 minute, 60 minutes is 1 hour, but 24 hours is 1 day, and 7 days is 1 week. Here, 60, 60, 24, and 7 are different radix bases. In other words, seconds, minutes, hours, days, and weeks are an example of using a mixed radix system.

Assume that counting from right to left, the 1st digit uses base $\beta_1$, the 2nd digit uses base $\beta_2$, the 3rd digit uses base $\beta_3$, …, then the $m$-th digit of $n$ is given by

$$ \left\lfloor\frac{n}{\beta_1\beta_2\cdots\beta_{m-1}}\right\rfloor\bmod\beta_m\tag{5} $$

Why consider mixed radix? This is because one day the author discovered an interesting fact: RoPE is essentially a relative position encoding, which is a special case of a Toeplitz matrix. It looks like this (the upper right part is omitted as this article mainly focuses on language models):

$$ \begin{pmatrix}0 & \\1 & 0 & \\2 & 1 & 0 &\\3 & 2 & 1 & 0 & \\4 & 3 & 2 & 1 & 0 & \\5 & 4 & 3 & 2 & 1 & 0 & \\6 & 5 & 4 & 3 & 2 & 1 & 0 & \\\end{pmatrix} $$

From the above matrix, we can see that the position distribution of relative position encoding is uneven! The number 0 appears most frequently, followed by 1, then 2, and so on. That is, larger $n$ appears less frequently. This means that, as a base-$\beta$ encoding, RoPE’s “high bits” are likely not sufficiently trained, or in other words, the generalization ability of the high bits may not be as good as the low bits. As we just said, NTK-RoPE spreads the extrapolation pressure across each bit. If this conjecture is reasonable, then “spreading evenly” is not optimal. The low bits should bear more, and the high bits less. This leads to the concept of mixed radix.

Allocation Optimization
#

Specifically, we extend to $k$ times the Context by converting from base-$\beta$ to a mixed radix of $\beta_1,\beta_2,\cdots,\beta_{d/2}$, where $\beta_m = \beta \lambda_m$. At this point, equation $\text{(5)}$ becomes

$$ \left\lfloor\frac{n}{\beta^{m-1}(\lambda_1\lambda_2\cdots\lambda_{m-1})}\right\rfloor\bmod(\beta\lambda_m) $$

Equation $\text{(4)}$ also changes accordingly to

$$ \boldsymbol{p}_n=\big[\cos\theta_1,\sin\theta_1,\cos\theta_2,\sin\theta_2,\cdots,\cos\theta_{d/2},\sin\theta_{d/2}\big]\\[5pt]\theta_m = \frac{n}{\beta^{m-1}(\lambda_1\lambda_2\cdots\lambda_m)},\quad \beta=10000^{2/d} $$

According to the principles of “extending $k$ times” and “low bits should bear more”, the constraints are

$$ \lambda_1\lambda_2\cdots\lambda_{d/2} = k,\quad \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_{d/2} \geq 1 $$

We discuss solutions of the following form (interested readers can explore other forms of solutions, the degrees of freedom here are quite large)

$$ \lambda_1\lambda_2\cdots\lambda_m = \exp(am^b) $$

When $a > 0, b\leq 1$, it satisfies the condition $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_{d/2} \geq 1$. When $b=1$, it is actually the “NTK-RoPE-fixed” mentioned earlier. When $b=0$, it is “Positional Interpolation (PI)”. $\lambda_1\lambda_2\cdots\lambda_{d/2} = k$ gives the constraint

$$ a\left(\frac{d}{2}\right)^b = \log k $$

So there is only one degree of freedom to tune. After simple binary search, the author found that in his experiments, $b=0.625$ achieves relatively good extension effects on average (different models may have different optimal solutions, please tune yourself). This version is called “NTK-RoPE-mixed”.

Experimental Results
#

Based on the experiments in 《Transformer Upgrade Path: 10. RoPE is a Base-$\beta$ Encoding》, the author added experiments for “NTK-RoPE-fixed” and “NTK-RoPE-mixed”. The comparison is as follows:

Test Length512 (Train)4096 (Repeat)4096 (Non-repeat)
Baseline49.41%24.17%23.16%
Baseline-$\log n$49.40%24.60%24.02%
PI-RoPE49.41%15.04%13.54%
PI-RoPE-$\log n$49.40%14.99%16.51%
NTK-RoPE-old49.41%51.28%39.27%
NTK-RoPE-$\log n$-old49.40%61.71%43.75%
NTK-RoPE-fixed49.41%51.86%39.61%
NTK-RoPE-$\log n$-fixed49.40%62.85%44.14%
NTK-RoPE-mixed49.41%53.09%40.12%
NTK-RoPE-$\log n$-mixed49.40%68.91%45.41%

As can be seen, compared to the uniform radix “NTK-RoPE-old” and “NTK-RoPE-fixed”, the improvement brought by the mixed radix derived “NTK-RoPE-mixed” is still significant, and it requires no fine-tuning, truly a “free lunch”. Furthermore, it can be seen that the $\log n$ version’s extrapolation performance is indeed better. However, the $\log n$ trick needs to be added during the pre-training stage. Some readers previously asked if models like LLAMA, which did not incorporate the $\log n$ trick during pre-training, could benefit from the $\log n$ “dividend”. After testing, the author found that it can be improved by adding the following scale factor:

$$ \max(1, \log_{\text{maxlen}} n)\tag{6} $$

Here, $\text{maxlen}$ is the maximum length during pre-training, which is 512 in the experiments in this article, 2048 in LLAMA, and 4096 in LLAMA2. When implementing, one can directly multiply each $\boldsymbol{q}_n$ by the corresponding factor. In this way, the part within $\text{maxlen}$ is unaffected, while the part outside is scaled by $\log n$, which is a simple transition. The effect is as follows (adding a † to distinguish from the original $\log n$):

Test Length512 (Train)4096 (Repeat)4096 (Non-repeat)
NTK-RoPE-fixed49.41%51.86%39.61%
NTK-RoPE-$\log n$†-fixed49.41%55.94%41.11%
NTK-RoPE-mixed49.41%53.09%40.12%
NTK-RoPE-$\log n$†-mixed49.41%59.11%42.38%

As can be seen, this $\log n^{\dagger}$ can also be considered a free lunch. In summary, if you plan to pre-train from scratch, it is advisable to incorporate the $\log n$ trick beforehand. If pre-training is already completed, you can use equation $\text{(6)}$ as a substitute, and finally add NTK-RoPE-mixed to achieve better results for extending Context.

Summary (formatted)
#

In this article, we revisited the base-$\beta$ perspective of RoPE and attempted to generalize NTK-aware Scaled RoPE. Inspired by mixed radix systems, we obtained a better strategy for extending Context length without fine-tuning, and finally demonstrated its effectiveness through experiments.

@online{kexuefm-9706,
        title={Transformer Upgrade Path: 11. Carrying the Base-$\beta$ Position to the End},
        author={苏剑林},
        year={2023},
        month={07},
        url={\url{https://kexue.fm/archives/9706}},
}
Road to a better Transformer - This article is part of a series.
Part 11: This Article