gemini-2.5-flash-preview-04-17
translation of a Chinese article. Beware of potential errors.In the previous article “The Transformer Upgradation Path: 2. RoPE: Rotary Position Embedding”, we proposed the Rotary Position Embedding (RoPE) and the corresponding Transformer model, RoFormer. Since my main research area is NLP, I initially considered this topic finished for myself. However, recently, Transformer models have also become very popular in the field of computer vision, with various Vision Transformers (ViT) emerging one after another. This led to a question: What should 2D RoPE look like?
At first glance, this might seem like a simple generalization of the 1D case, but the derivations and understanding involved are far more complex than we might imagine. This article analyzes this to deepen our understanding of RoPE.
2D RoPE#
What is a 2D position? What does the corresponding 2D RoPE look like? Where does the difficulty lie? In this section, we will first briefly introduce 2D positions and then directly present the result and derivation idea for 2D RoPE. In the following sections, we will provide the detailed derivation process.
2D Positions#
In NLP, the position information of language is 1D; in other words, we need to tell the model which word in the sentence this word is. However, in CV, image position information is 2D, meaning we need to tell the model which row and which column a feature is located in. The “2D” here refers to the fact that two numbers are needed to fully describe the position information, not the dimensionality of the position vector.
Some readers might think: Can’t we just flatten it and treat it as 1D? That doesn’t quite work. For example, on an $h \times h$ feature map, the position $(x,y)$ becomes $xh + y$ after flattening, while positions $(x+1,y)$ and $(x,y+1)$ become $xh+y+h$ and $xh+y+1$, respectively. The differences from $xh + y$ are $h$ and $1$. However, according to our intuitive understanding, the distance between $(x+1,y)$ and $(x,y)$ and the distance between $(x,y+1)$ and $(x,y)$ should be the same. Getting different values $h$ and $1$ after flattening seems unreasonable.
Therefore, we need to specially design position encodings for the 2D case and cannot simply flatten it into 1D.
Standard Answer#
After the following derivations, one solution for 2D RoPE is found to be:
$$ \boldsymbol{\mathcal{R}}_{x,y}=\left( \begin{array}{cc:cc} \cos x\theta & -\sin x\theta & 0 & 0 \\ \sin x\theta & \cos x\theta & 0 & 0 \\ \hdashline 0 & 0 & \cos y\theta & -\sin y\theta \\ 0 & 0 & \sin y\theta & \cos y\theta \\ \end{array}\right) $$This solution is easy to understand. It is a block matrix composed of two 1D RoPEs. In implementation, it splits the input vector into two halves, applying the 1D RoPE for $x$ to one half and the 1D RoPE for $y$ to the other half. From this form, it is easy to draw an analogy to RoPE for 3D, 4D, and other positions.
The matrix above is an orthogonal matrix, satisfying two key properties:
Relativity: $\boldsymbol{\mathcal{R}}_{x_1,y_1}^{\top}\boldsymbol{\mathcal{R}}_{x_2,y_2}=\boldsymbol{\mathcal{R}}_{x_2-x_1,y_2-y_1}$. It is precisely because of this property that RoPE can achieve relative position encoding through absolute positions.
Invertibility: Given $\boldsymbol{\mathcal{R}}_{x,y}$, $x,y$ can be uniquely solved. This means that the encoding of position information is lossless.
In a sense, the equation above is the simplest solution that satisfies these two properties. While slightly different solutions might exist that satisfy these properties, they are relatively more complex in form and implementation.
Derivation Idea#
In hindsight, RoPE essentially found the matrix $\boldsymbol{\mathcal{R}}_n=\begin{pmatrix}\cos n\theta & -\sin n\theta\\ \sin n\theta & \cos n\theta\end{pmatrix}$ such that it satisfies the “relativity” condition:
$$ \boldsymbol{\mathcal{R}}_m^{\top}\boldsymbol{\mathcal{R}}_n=\boldsymbol{\mathcal{R}}_{n-m} $$Therefore, it’s natural to think that the basic requirement for 2D RoPE is also to satisfy relativity, i.e., finding a matrix $\boldsymbol{\mathcal{R}}_{x,y}$ such that it satisfies the 2D relativity condition $\boldsymbol{\mathcal{R}}_{x_1,y_1}^{\top}\boldsymbol{\mathcal{R}}_{x_2,y_2}=\boldsymbol{\mathcal{R}}_{x_2-x_1,y_2-y_1}$. However, if this were the only requirement, there would be many feasible solutions, such as simply letting
$$ \boldsymbol{\mathcal{R}}_{x,y} = \begin{pmatrix}\cos (x+y)\theta & -\sin (x+y)\theta\\ \sin (x+y)\theta & \cos (x+y)\theta\end{pmatrix} $$But the problem with this solution is that we cannot uniquely derive $(x,y)$ from $x+y$. This means this choice is lossy for position information. Therefore, we need an additional “invertibility” property to ensure that the original position signals can be losslessly reconstructed from the position matrix.
To achieve this, we have two relatively natural avenues to choose from: 1. Quaternions; 2. Matrix exponentials. We will introduce them one by one in the following sections.
Quaternions#
In the derivation of 1D RoPE, we primarily used complex numbers as a tool. Quaternions are a generalization of complex numbers and also retain many of their properties. Therefore, using them to derive 2D RoPE is also a natural approach. However, it is unfortunately a dead end, but I will still include the thought process here for reference.
Complex Numbers and Matrices#
In high school, we learned that a complex number $a+b\boldsymbol{i}$ corresponds one-to-one with a 2D vector $(a,b)$ (I’ve bolded the imaginary unit $\boldsymbol{i}$ here to align with quaternions later). However, this correspondence only holds for addition and subtraction (because vectors don’t have a universal multiplication operation). A more elegant correspondence is mapping complex numbers to matrices:
$$ a+b\boldsymbol{i} \quad \leftrightarrow \quad \begin{pmatrix} a & -b \\ b & a \end{pmatrix} $$Under this mapping, the addition, subtraction, multiplication, and division of complex numbers correspond one-to-one with matrix addition, subtraction, multiplication, and division. For example:
$$ \begin{array}{ccc} (a+b\boldsymbol{i})(c+d\boldsymbol{i}) & = & (ac - bd) + (ad + bc)\boldsymbol{i} \\[5pt] \begin{pmatrix} a & -b \\ b & a \end{pmatrix}\begin{pmatrix} c & -d \\ d & c \end{pmatrix} & = & \begin{pmatrix} ac - bd & - ad - bc \\ ad + bc & ac - bd \end{pmatrix} \end{array} $$Thus, the matrix mapping is a complete isomorphism of the complex domain, while the vector mapping is merely an intuitive geometric understanding.
The matrix mapping of complex numbers is also an important basis for RoPE. In “The Transformer Upgradation Path: 2. RoPE: Rotary Position Embedding”, we derived that the complex representation of RoPE is $\boldsymbol{q}e^{n\boldsymbol{i}\theta}=(\cos n\theta + \boldsymbol{i}\sin n\theta)\boldsymbol{q}$. Therefore, according to the matrix mapping of complex numbers, $\cos n\theta + \boldsymbol{i}\sin n\theta$ corresponds to the matrix
$$ \boldsymbol{\mathcal{R}}_n=\begin{pmatrix}\cos n\theta & -\sin n\theta\\ \sin n\theta & \cos n\theta\end{pmatrix} $$thereby obtaining the matrix form of 1D RoPE.
Quaternion Introduction#
As mentioned earlier, quaternions are a generalization of complex numbers. In fact, they are also the “ancestors” of matrices. Historically, quaternions came before general matrix operations, and they inspired many matrix operations. Years ago, I also wrote articles like “Deep Roots of Quaternions in Vectors” and “Geometry of Numbers and Numbers of Geometry: A Brief Exploration of Hypercomplex Numbers” to introduce quaternions, which readers are welcome to refer to.
If complex numbers are 2D vectors, then quaternions are 4D vectors, represented as $a+b\boldsymbol{i}+c\boldsymbol{j}+d\boldsymbol{k}$, where $\boldsymbol{i}^2=\boldsymbol{j}^2=\boldsymbol{k}^2=-1$, but they are all distinct. The operation rules between the bases are:
× | 1 i j k
--|--------------
1 | 1 i j k
i | i -1 k -j
j | j -k -1 i
k | k j -i -1
At the time, the biggest impact they had on people was their non-commutativity, for example, $\boldsymbol{i}\boldsymbol{j}=-\boldsymbol{j}\boldsymbol{i}\neq \boldsymbol{j}\boldsymbol{i}$. Apart from this, their operations are actually highly similar to those of complex numbers.
For example, similar to Euler’s formula for complex numbers:
$$ e^{a+b\boldsymbol{i}+c\boldsymbol{j}+d\boldsymbol{k}} = e^a\left(\cos r + \frac{b\boldsymbol{i}+c\boldsymbol{j}+d\boldsymbol{k}}{r}\sin r\right) $$Here $r = \Vert b\boldsymbol{i}+c\boldsymbol{j}+d\boldsymbol{k}\Vert = \sqrt{b^2+c^2+d^2}$. Furthermore, there is a similar matrix mapping:
$$ a+b\boldsymbol{i}+c\boldsymbol{j}+d\boldsymbol{k} \quad \leftrightarrow \quad \begin{pmatrix} a & -b & -c & -d \\ b & a & -d & c \\ c & d & a & -b \\ d & -c & b & a \end{pmatrix} $$Violation of Relativity#
Regarding the origins behind these formulas, that’s a long story, and I won’t go into detail here. Interested readers should search for resources. With Euler’s formula and the exponential mapping, readers might realize: 1D RoPE is simply the matrix mapping corresponding to $e^{n\boldsymbol{i}\theta}$. So for 2D RoPE, couldn’t we just map $e^{x\boldsymbol{i}\theta + y\boldsymbol{j}\theta}$ to a matrix form?
I initially thought this too, but unfortunately, it is incorrect. Where is the mistake? In 1D RoPE, we used the complex representation of the inner product:
$$ \langle\boldsymbol{q},\boldsymbol{k}\rangle=\text{Re}[\boldsymbol{q}\boldsymbol{k}^*] $$This identity also holds for quaternions, so it can be directly applied. Then we used the complex exponential:
$$ \langle\boldsymbol{q}e^{m\boldsymbol{i}\theta},\boldsymbol{k}\boldsymbol{q}e^{n\boldsymbol{i}\theta}\rangle=\text{Re}\left[\left(\boldsymbol{q}e^{m\boldsymbol{i}\theta}\right)\left(\boldsymbol{k}e^{n\boldsymbol{i}\theta}\right)^*\right]=\text{Re}\left[\boldsymbol{q}e^{m\boldsymbol{i}\theta}e^{-n\boldsymbol{i}\theta}\boldsymbol{k}^*\right]=\text{Re}\left[\boldsymbol{q}e^{(m-n)\boldsymbol{i}\theta}\boldsymbol{k}^*\right] $$The first two equalities can be carried over to quaternions, but the key is that the third equality is not always true in quaternions! In general, for two quaternions $\boldsymbol{p},\boldsymbol{q}$, the equality $e^{\boldsymbol{p}+\boldsymbol{q}}=e^{\boldsymbol{p}}e^{\boldsymbol{q}}$ does not hold! More broadly, for two objects whose multiplication does not satisfy the commutative law, generally $e^{\boldsymbol{p}+\boldsymbol{q}}\neq e^{\boldsymbol{p}}e^{\boldsymbol{q}}$.
Therefore, in the end, because exponential multiplication cannot be converted to addition, the final relativity property cannot be guaranteed. Thus, the path through quaternion derivation is terminated…
Matrix Exponential#
The matrix mapping of quaternions shows that quaternions actually represent a specific class of $4\times 4$ matrices. If the quaternion derivation doesn’t work, perhaps a general matrix analysis can. Indeed, this is the case. In this section, we will use the matrix exponential to provide a derivation result.
Matrix Exponential#
The matrix exponential here is not the element-wise operation of using the exponential function as the activation function in neural networks, but rather an operation defined by a power series:
$$ \exp \boldsymbol{B} = \sum_{k=0}^{\infty}\frac{\boldsymbol{B}^k}{k!} $$where $\boldsymbol{B}^k$ refers to multiplying $k$ copies of $\boldsymbol{B}$ together using matrix multiplication. Regarding the matrix exponential, I previously wrote “Appreciating the Identity det(exp(A)) = exp(Tr(A))”, which is also welcome for reference.
The matrix exponential is a very important matrix operation. It can directly provide the solution to the constant coefficient differential equation system $\frac{d}{dt}\boldsymbol{x}_t=\boldsymbol{A}\boldsymbol{x}_t$:
$$ \boldsymbol{x}_t = \big(\exp t\boldsymbol{A}\big)\boldsymbol{x}_0 $$Of course, this is not very relevant to the main topic of this article. For the derivation of RoPE, we primarily use the following property of the matrix exponential:
$$ \boldsymbol{A}\boldsymbol{B} = \boldsymbol{B}\boldsymbol{A} \quad\Rightarrow\quad \big(\exp \boldsymbol{A}\big)\big(\exp \boldsymbol{B}\big) = \exp \big(\boldsymbol{A} + \boldsymbol{B}\big) $$This means that if the multiplication of $\boldsymbol{A},\boldsymbol{B}$ is commutative, then the matrix exponential can convert multiplication into addition, just like the exponential of numbers. However, note that this is a sufficient but not necessary condition.
As for how to calculate the matrix exponential, I won’t go into detail here, but many software libraries already include matrix exponential operations, such as the numerical computation library scipy and tensorflow which have the expm
function, and for symbolic computation, Mathematica has the MatrixExp
function.
1D General Solution#
Why can RoPE be related to the matrix exponential? Because the 1D RoPE has a relatively simple exponential expression:
$$ \boldsymbol{\mathcal{R}}_n=\begin{pmatrix}\cos n\theta & -\sin n\theta\\ \sin n\theta & \cos n\theta\end{pmatrix}=\exp\left\{n\theta\begin{pmatrix}0 & -1\\ 1 & 0\end{pmatrix}\right\} $$So I started considering matrices of the form
$$ \boldsymbol{\mathcal{R}}_n=\exp n\boldsymbol{B} $$as a solution for RoPE, where $\boldsymbol{B}$ is a matrix independent of $n$. A necessary condition for RoPE is to satisfy the “relativity” condition. So we analyze
$$ \big(\exp m\boldsymbol{B}\big)^{\top}\big(\exp n\boldsymbol{B}\big) = \big(\exp m\boldsymbol{B}^{\top}\big)\big(\exp n\boldsymbol{B}\big) $$Here, let’s first assume that $\boldsymbol{B}^{\top}$ and $\boldsymbol{B}$ are commutative. Then according to the equation above we have
$$ \big(\exp m\boldsymbol{B}^{\top}\big)\big(\exp n\boldsymbol{B}\big) = \exp \big(m\boldsymbol{B}^{\top} + n\boldsymbol{B}\big) $$To make $m\boldsymbol{B}^{\top} + n\boldsymbol{B}=(n-m)\boldsymbol{B}$, we only need to satisfy
$$ \boldsymbol{B}^{\top} = - \boldsymbol{B} $$This is the constraint condition given by “relativity”. We also assumed that $\boldsymbol{B}^{\top}$ and $\boldsymbol{B}$ are commutative. Now we can verify that if this equation is satisfied, $\boldsymbol{B}^{\top}$ and $\boldsymbol{B}$ are indeed commutative, so the result is self-consistent.
This means that for any matrix $\boldsymbol{B}$ satisfying $\boldsymbol{B}^{\top} + \boldsymbol{B} = 0$, $\exp n\boldsymbol{B}$ is a solution to the equation above, and it can also be proven to be an orthogonal matrix. Of course, from $\exp n\boldsymbol{B}=\left(\exp \boldsymbol{B}\right)^n$, we more directly get that for any orthogonal matrix $\boldsymbol{O}$, $\boldsymbol{\mathcal{R}}_n=\boldsymbol{O}^n$ is a solution to the equation above.
For $2\times 2$ matrices, the general solution for $\boldsymbol{B}^{\top} + \boldsymbol{B} = 0$ is $\boldsymbol{B}=\begin{pmatrix}0 & -\theta\\ \theta & 0\end{pmatrix}$, which leads to the solution shown in the equation above.
2D Constraints#
Similarly, for 2D RoPE, we consider
$$ \boldsymbol{\mathcal{R}}_{x,y}=\exp \big(x\boldsymbol{B}_1 + y\boldsymbol{B}_2\big) $$as a candidate solution. Repeating the derivation for the “relativity” condition: First, assume that $x_1\boldsymbol{B}_1^{\top} + y_1\boldsymbol{B}_2^{\top}$ and $x_2\boldsymbol{B}_1 + y_2\boldsymbol{B}_2$ are commutative. Then we can obtain the following constraint conditions:
$$ \left\{\begin{aligned} &\boldsymbol{B}_1^{\top} + \boldsymbol{B}_1 = 0\\ &\boldsymbol{B}_2^{\top} + \boldsymbol{B}_2 = 0\\ &\boldsymbol{B}_1 \boldsymbol{B}_2^{\top} = \boldsymbol{B}_2^{\top} \boldsymbol{B}_1 \end{aligned}\right. $$It is easy to prove that under the first two conditions, the new constraint condition is equivalent to $\boldsymbol{B}_1 \boldsymbol{B}_2 = \boldsymbol{B}_2 \boldsymbol{B}_1$.
RoPE Emerges#
Since a $2\times 2$ matrix satisfying the first two conditions only has one independent parameter, which doesn’t satisfy “invertibility”, we must at least consider $3\times 3$ matrices. They have 3 independent parameters:
$$ \begin{pmatrix}0 & -a & -b \\ a & 0 & -c \\ b & c & 0\end{pmatrix} $$To ensure invertibility, we might assume $\boldsymbol{B}_1,\boldsymbol{B}_2$ are “orthogonal”. For example, let:
$$ \boldsymbol{B}_1=\begin{pmatrix}0 & -a & 0 \\ a & 0 & 0 \\ 0 & 0 & 0\end{pmatrix},\quad\boldsymbol{B}_2=\begin{pmatrix}0 & 0 & -b \\ 0 & 0 & -c \\ b & c & 0\end{pmatrix} $$Without loss of generality, we can set $a=1$. Then from the conditions above, we solve for $b=0,c=0$, meaning $\boldsymbol{B}_2$ must be the zero matrix, which doesn’t meet our requirements. The Mathematica code for solving is:
B[a_, b_, c_] = {{0, -a, -b}, {a, 0, -c}, {b, c, 0}};
B1 = B[1, 0, 0];
B2 = B[0, b, c];
Solve[{Dot[B1, B2] == Dot[B2, B1]}, {b, c}]
Therefore, we must at least consider $4\times 4$ matrices. They have 6 independent parameters. Without loss of generality, consider orthogonal decomposition:
$$ \boldsymbol{B}_1=\begin{pmatrix}0 & -a & -b & 0 \\ a & 0 & -c & 0 \\ b & c & 0 & 0 \\ 0 & 0 & 0 & 0\end{pmatrix},\quad\boldsymbol{B}_2=\begin{pmatrix}0 & 0 & 0 & -d \\ 0 & 0 & 0 & -e \\ 0 & 0 & 0 & -f \\ d & e & f & 0\end{pmatrix} $$Solving gives:
$$ d=cf,\quad e=-bf $$The solving code:
B[a_, b_, c_, d_, e_,
f_] = {{0, -a, -b, -d}, {a, 0, -c, -e}, {b, c, 0, -f}, {d, e, f,
0}};
B1 = B[1, b, c, 0, 0, 0];
B2 = B[0, 0, 0, d, e, f];
Solve[{Dot[B1, B2] == Dot[B2, B1]}, {b, c, d, e, f}]
It can be seen that the result does not impose a constraint on $f$. So for simplicity, we can set $f=1$ and the remaining $b,c,d,e$ all to 0. In this case,
$$ \boldsymbol{\mathcal{R}}_{x,y}=\exp \,\begin{pmatrix}0 & -x & 0 & 0 \\ x & 0 & 0 & 0 \\ 0 & 0 & 0 & -y \\ 0 & 0 & y & 0\end{pmatrix} $$We can add a parameter $\theta$ and expand it to get:
$$ \boldsymbol{\mathcal{R}}_{x,y}=\exp \,\left\{\begin{pmatrix}0 & -x & 0 & 0 \\ x & 0 & 0 & 0 \\ 0 & 0 & 0 & -y \\ 0 & 0 & y & 0\end{pmatrix}\theta\right\}=\left( \begin{array}{cc:cc} \cos x\theta & -\sin x\theta & 0 & 0 \\ \sin x\theta & \cos x\theta & 0 & 0 \\ \hdashline 0 & 0 & \cos y\theta & -\sin y\theta \\ 0 & 0 & \sin y\theta & \cos y\theta \\ \end{array}\right) $$Extended Story#
So far, the derivation of 2D RoPE has been introduced. Readers might now wonder about its effectiveness. Unfortunately, there are no complete experimental results yet. After all, I haven’t done ViT-related work before, and the derivation of this 2D RoPE was just completed not long ago, so progress is relatively slow. All I can say is that preliminary results show it is quite effective. Members of the EleutherAI team have also experimented with this approach, and the results are also better than other existing position encodings.
Speaking of the EleutherAI team, let me add a few more words. The EleutherAI team is the team that recently became quite popular for claiming to “reproduce GPT3”. After we proposed RoPE and RoFormer in the article “The Transformer Upgradation Path: 2. RoPE: Rotary Position Embedding”, we were fortunate to gain the attention of the EleutherAI team. They conducted many supplementary experiments and confirmed that RoPE is more effective than many other position encodings (see their blog post “Rotary Embeddings: A Relative Revolution”). This prompted us to complete the English paper “RoFormer: Enhanced Transformer with Rotary Position Embedding” and submit it to Arxiv. And the question about 2D RoPE originally came from the EleutherAI team.
Summary (formatted)#
This article introduced our 2D generalization of RoPE, mainly starting from “relativity” and “invertibility” to determine the final form of 2D RoPE. We attempted two derivation processes, quaternions and matrix exponentials, and finally used the matrix exponential to give the final solution. Through the derivation process, we can also deepen our understanding of RoPE.
@online{kexuefm-8397,
title={The Transformer Upgradation Path: 4. Rotary Position Embedding for 2D Positions},
author={苏剑林},
year={2021},
month={05},
url={\url{https://kexue.fm/archives/8397}},
}