gemini-2.5-flash-preview-04-17
translation of a Chinese article. Beware of potential errors.Readers who have followed the “Transformer Upgrade Path” series to this point must already be familiar with Rotary Positional Embedding (RoPE). Simply put, RoPE is a rotation transformation applied to the Attention’s Query ($\boldsymbol{Q}$) and Key ($\boldsymbol{K}$). It formally belongs to absolute positional encoding, but when combined with the Dot-Product characteristic of Attention, it can automatically achieve the effect of relative position.
So, can RoPE be applied to Value ($\boldsymbol{V}$)? It seems not, because after rotating $\boldsymbol{V}$, it would no longer be a relative positional encoding. However, things are not that absolute. This article will discuss applying RoPE to $\boldsymbol{V}$, which we can call the “second type of rotary positional embedding.”
Basic Review#
We decompose Dot-Product Attention as follows:
$$ \boldsymbol{o}_i = \sum_j a_{i,j}\boldsymbol{v}_j,\qquad a_{i,j} = \frac{e^{s_{i,j}}}{\sum\limits_j e^{s_{i,j}}},\qquad s_{i,j} = \boldsymbol{q}_i^{\top}\boldsymbol{k}_j $$For simplicity, the scaling factor for $s_{i,j}$ is omitted here. RoPE is applied to $\boldsymbol{q}_i, \boldsymbol{k}_j$:
$$ \boldsymbol{q}_i \to \boldsymbol{\mathcal{R}}_i\boldsymbol{q}_i,\qquad \boldsymbol{k}_j \to \boldsymbol{\mathcal{R}}_j\boldsymbol{k}_j $$This results in the Attention Logits, $s_{i,j}$, becoming
$$ s_{i,j} = (\boldsymbol{\mathcal{R}}_i\boldsymbol{q}_i)^{\top} (\boldsymbol{\mathcal{R}}_j\boldsymbol{k}_j) = \boldsymbol{q}_i^{\top}\boldsymbol{\mathcal{R}}_i^{\top}\boldsymbol{\mathcal{R}}_j\boldsymbol{k}_j=\boldsymbol{q}_i^{\top}\boldsymbol{\mathcal{R}}_{j-i}\boldsymbol{k}_j $$This means $s_{i,j}$ only depends on the relative position $j-i$, thereby achieving the effect of relative position through an absolute position form. This transformation process utilizes the property of rotation matrices: $\boldsymbol{\mathcal{R}}_i^{\top}\boldsymbol{\mathcal{R}}_j=\boldsymbol{\mathcal{R}}_{j-i}$.
Besides rotation matrices, in “The Transformer Upgrade Path: 4. Two-Dimensional Rotary Positional Embedding”, we proved that its general solution is $\boldsymbol{\mathcal{R}}_i = \boldsymbol{O}^i$, where $\boldsymbol{O}$ is an arbitrary orthogonal matrix, and the superscript denotes matrix exponentiation. However, later in “The Transformer Upgrade Path: 6. Completeness Analysis of Rotary Positional Embedding”, we also proved that the general orthogonal matrix solution is essentially isomorphic to the rotation matrix solution.
New Usage#
What if RoPE is applied to $\boldsymbol{v}_j$, i.e., $\boldsymbol{v}_j \to \boldsymbol{\mathcal{R}}_j\boldsymbol{v}_j$? Apparently, the Attention result would be
$$ \boldsymbol{o}_i = \sum_j a_{i,j} \boldsymbol{\mathcal{R}}_j\boldsymbol{v}_j $$This would cause the Attention to explicitly depend on the absolute position $j$. If we only want a positional encoding, then perhaps it’s not a big issue, but if we want a relative positional encoding, it cannot satisfy our purpose.
However, there’s a simple trick to solve this defect! We can apply a reverse RoPE to $\boldsymbol{o}_i$ once more:
$$ \boldsymbol{o}_i = \boldsymbol{\mathcal{R}}_i^{\top}\left(\sum_j a_{i,j} \boldsymbol{\mathcal{R}}_j\boldsymbol{v}_j\right)=\sum_j a_{i,j} \boldsymbol{\mathcal{R}}_i^{\top}\boldsymbol{\mathcal{R}_j}\boldsymbol{v}_j=\sum_j a_{i,j} \boldsymbol{\mathcal{R}}_{j-i}\boldsymbol{v}_j $$Thus, it becomes a relative positional encoding again! The form is also two absolute position encodings, which is remarkably similar to the existing RoPE. Therefore, we call it the “second type of rotary positional encoding,” or more intuitively, “VO-RoPE,” because it applies RoPE to both Value and Output. Correspondingly, the standard RoPE can be called “QK-RoPE.”
Simple Experiments#
A quick set of experiments was conducted on a LLAMA-like model of around 1B parameters, comparing the following configurations:
- NoPE: No positional encoding added at all.
- QK-RoPE: Standard rotary positional encoding.
- VO-RoPE: The second type of rotary positional encoding proposed in this article.
- Q/K/V/O-RoPE: Rotary positional encoding added separately to one of Q, K, V, or O.
- QKV-RoPE: Rotary positional encoding added to Q, K, and V.
- QKVO-RoPE: Rotary positional encoding added to Q, K, V, and O.
Note that points 4 and 5 are considered absolute positional encodings. The approximate conclusions are:
$$ \text{QK-RoPE}\approx \text{QKVO-RoPE} > \text{K-RoPE}\approx \text{VO-RoPE} > \text{QKV-RoPE} > \text{NoPE} > \text{Q/V/O-RoPE} $$The specific differences in loss function are:
$$ \begin{array}{c|c} \hline & \text{Loss} \\ \hline \text{QK-RoPE} & 2.712 \\ \text{QKVO-RoPE} & 2.719 \\ \text{K-RoPE} & 2.769 \\ \text{VO-RoPE} & 2.770 \\ \text{QKV-RoPE} & 2.783 \\ \text{NoPE} & 2.795 \\ \text{O-RoPE} & 2.841 \\ \text{Q-RoPE} & 2.851 \\ \text{V-RoPE} & 2.856 \\ \hline \end{array} $$Some Thoughts#
From the results above, it can be seen that VO-RoPE is better than NoPE but not as good as QK-RoPE, and combining VO-RoPE and QK-RoPE does not yield any improvement. In this case, does VO-RoPE seem unnecessary?
In my opinion, complementing the usage of RoPE and answering the question “Can RoPE be applied to Value?”, and then clarifying through experiments that “there is no benefit” is itself valuable. Moreover, in the long run, it may not always be without benefit; it just might not show significant effects in the current mainstream language model settings. When I first proposed RoPE, the motivation was purely for fun, and I didn’t expect it to be a competitive positional encoding (the subsequent events were fortunate).
As for the current situation, VO-RoPE does have a potential application scenario related to MLA, introduced in “The Trade-off Between Cache and Effect: From MHA, MQA, GQA to MLA”. We know that MLA is roughly equivalent to an MQA with shared K and V during inference:
$$ \boldsymbol{o}_i = \sum_{j=1}^i a_{i,j}\boldsymbol{c}_j,\qquad a_{i,j} = \frac{e^{s_{i,j}}}{\sum\limits_{j=1}^i e^{s_{i,j}}},\qquad s_{i,j} = \exp(\boldsymbol{q}_i^{\top}\boldsymbol{c}_j) $$This characteristic allows its KV Cache to only contain $\boldsymbol{c}$. However, this important characteristic is not fully compatible with QK-RoPE, because once RoPE is added to $\boldsymbol{c}_j$ within the Attention matrix, there are two possible outcomes:
- RoPE is not added to $\boldsymbol{c}_j$ on the Value side, in which case K and V are not fully shared. This leads to either doubling the KV Cache (requiring caching both before and after RoPE) or injecting RoPE into K in real-time (causing latency).
- If RoPE is added to $\boldsymbol{c}_j$ on the Value side, the effect of K and V sharing can be achieved, but at this point, it is no longer a relative positional encoding.
To solve this problem, MLA adopts a splicing approach of “mostly NoPE + a small portion of RoPE.” However, from the second type of rotary positional encoding discussed in this article, we know that it only requires adding O-RoPE to the Output again:
$$ \boldsymbol{o}_i = \boldsymbol{\mathcal{R}}_i^{\top}\sum_{j=1}^i a_{i,j}(\boldsymbol{\mathcal{R}}_j\boldsymbol{c}_j),\qquad a_{i,j} = \frac{e^{s_{i,j}}}{\sum\limits_{j=1}^i e^{s_{i,j}}},\qquad s_{i,j} = (\boldsymbol{\mathcal{R}}_i\boldsymbol{q}_i)^{\top} (\boldsymbol{\mathcal{R}}_j\boldsymbol{c}_j) $$However, this idea is not fully worked out yet and cannot be directly applied to the training form of MLA. I am writing it down first for everyone’s reference.
Related Work#
In fact, VO-RoPE also cleverly provides an intermediate form from Attention to complex-valued linear RNNs (such as LRU, RetNet). Starting from the equation $\boldsymbol{o}_i = \boldsymbol{\mathcal{R}}_i^{\top}\left(\sum_j a_{i,j} \boldsymbol{\mathcal{R}}_j\boldsymbol{v}_j\right)=\sum_j a_{i,j} \boldsymbol{\mathcal{R}}_i^{\top}\boldsymbol{\mathcal{R}_j}\boldsymbol{v}_j=\sum_j a_{i,j} \boldsymbol{\mathcal{R}}_{j-i}\boldsymbol{v}_j$, considering the Causal scenario and taking a special case $a_{i,j}=\gamma^{i-j}$, where $0 < \gamma < 1$, we get
$$ \boldsymbol{o}_i = \sum_{j=1}^i \gamma^{i-j} \boldsymbol{\mathcal{R}}_{j-i}\boldsymbol{v}_j $$We know that the rotation matrix $\boldsymbol{\mathcal{R}}_{j-i}$, written in complex form, is actually a diagonal matrix of $e^{\mathbb{I}\theta (j - i)}$, where $\mathbb{I}$ is the imaginary unit (i.e., $\mathbb{I}^2=-1$). To distinguish it from the index $i$, $\mathbb{I}$ is used here. Thus, the above equation is equivalent to
$$ \boldsymbol{o}_i = \sum_{j=1}^i \gamma^{i-j} e^{\mathbb{I}\theta (j - i)} \boldsymbol{v}_j = \sum_{j=1}^i (\gamma e^{-\mathbb{I}\theta})^{i-j} \boldsymbol{v}_j $$This is essentially the simplest linear RNN with complex decay. From the derivation in “Google’s New Work Tries to ‘Revive’ RNNs: Can RNNs Shine Again?”, this type of RNN is theoretically more complete than RNNs with purely real decay.
Therefore, adding the VO-RoPE form to RoPE is equivalent to a generalization from real linear RNNs to complex linear RNNs. Theoretically, this could make its capabilities more complete, although this increased completeness might not necessarily help with language modeling tasks, just as the complex-valued LRU did not show an advantage over the purely real RWKV. However, theoretical completeness might imply special value in certain scenarios, who knows~
Summary (formatted)#
This article revolves around the question “Can RoPE be applied to V?” and discusses the second type of RoPE usage.
@online{kexuefm-10862,
title={The Transformer Upgrade Path: 19. The Second Type of Rotary Positional Embedding},
author={苏剑林},
year={2025},
month={04},
url={\url{https://kexue.fm/archives/10862}},
}