6. Completeness Analysis of Rotational Positional Encoding

In last year’s article “The Path to Transformer Upgrade: 2. Rotational Positional Encoding, Drawing on Strengths”, I introduced Rotational Positional Encoding (RoPE). At the time, the starting point was simply that using absolute positions to achieve relative positions was a “fun thing to do”, and I didn’t expect it to have quite good actual effects and be accepted by everyone. I have to say this was a pleasant surprise. Later, in “The Path to Transformer Upgrade: 4. Rotational Positional Encoding for Two-Dimensional Positions”, I discussed the two-dimensional form of RoPE and studied the general solution of RoPE represented by matrix exponentials.

Since a general solution is available, a question naturally arises: The RoPE we commonly use is just a block-diagonal matrix composed of two-dimensional rotation matrices as basic units. If we switch to the general solution, will the effect theoretically be better? This article will answer this question.

Exponential General Solution
#

In “The Path to Transformer Upgrade: 4. Rotational Positional Encoding for Two-Dimensional Positions”, we abstractly defined RoPE as any square matrix satisfying the following equation:

$$ \boldsymbol{\mathcal{R}}_m^{\top}\boldsymbol{\mathcal{R}}_n=\boldsymbol{\mathcal{R}}_{n-m} $$

Then, we explored the solution in the form of a matrix exponential:

$$ \boldsymbol{\mathcal{R}}_n=\exp n\boldsymbol{B} $$

The matrix exponential here is not an element-wise operation like an activation function such as Softmax, but is defined according to the Taylor series for “Matrix Exponential”. According to the “Baker–Campbell–Hausdorff formula”, we have

$$ \begin{aligned}\boldsymbol{\mathcal{R}}_m^{\top}\boldsymbol{\mathcal{R}}_n=&\,(\exp m\boldsymbol{B})^{\top}(\exp n\boldsymbol{B}) = (\exp m\boldsymbol{B}^{\top})(\exp n\boldsymbol{B}) \\=&\, \exp\left(m\boldsymbol{B}^{\top} + n\boldsymbol{B} + \frac{1}{2}mn\left[\boldsymbol{B}^{\top}, \boldsymbol{B}\right]+\cdots\right)\end{aligned} $$

Here $[\boldsymbol{A}, \boldsymbol{B}]=\boldsymbol{A}\boldsymbol{B}-\boldsymbol{B}\boldsymbol{A}$, and $\cdots$ omits terms of third order or higher in $m, n$. According to the equation above, the exponential part of the above equation should be equal to $(n-m)\boldsymbol{B}$, which implies

$$ \boldsymbol{B}^{\top} = - \boldsymbol{B} $$

That is, $\boldsymbol{B}$ must be a skew-symmetric matrix.

Orthogonal General Solution
#

Furthermore, we have $(\exp \boldsymbol{B})^{\top}(\exp \boldsymbol{B})=\exp(\boldsymbol{B}-\boldsymbol{B}) = \boldsymbol{I}$ and $\exp n\boldsymbol{B} = (\exp\boldsymbol{B})^n$. The former indicates that $\exp\boldsymbol{B}$ is an orthogonal matrix, and the latter suggests whether this can be generalized to any orthogonal matrix? It is not difficult to verify that the answer is yes; we have the conclusion:

For any orthogonal matrix $\boldsymbol{O}$, $\boldsymbol{\mathcal{R}}_n=\boldsymbol{O}^n$ is a solution satisfying the equation.

It is worth noting that in the real domain, not all orthogonal matrices can be written in the form $\exp\boldsymbol{B}$. Therefore, the determinant of $\exp\boldsymbol{B}$ must be greater than 0 (i.e., equal to 1). In fact, the converse is also true: an orthogonal matrix with a determinant equal to 1 can necessarily be written in the form $\exp\boldsymbol{B}$, where $\boldsymbol{B}$ is a skew-symmetric matrix. (Refer to “Appreciation of the Identity det(exp(A)) = exp(Tr(A))” and “Why can any orthogonal matrix be written as O=e^A”).

For an orthogonal matrix with $\det(\boldsymbol{O}) = -1$, we have $\boldsymbol{O}=\boldsymbol{O}_+ \boldsymbol{I}_-$, where $\boldsymbol{I}_-$ is a diagonal matrix with one diagonal element being -1 and the rest being 1, and $\boldsymbol{O}_+$ is an orthogonal matrix with $\det(\boldsymbol{O}_+) = 1$. It can be written in the form $\exp\boldsymbol{B}$, and in this case, $\boldsymbol{O}^n = (\boldsymbol{O}_+ \boldsymbol{I}_-)^n = \boldsymbol{I}_-^n\exp n\boldsymbol{B}$. This means that even for $\boldsymbol{O}^n$ with $\det(\boldsymbol{O}) = -1$, it is merely a simple transformation of $\exp n\boldsymbol{B}$. Therefore, we will primarily study solutions of the form $\exp n\boldsymbol{B}$.

Completeness Analysis
#

As is well known, the RoPE positional encoding we usually use is a block-diagonal matrix of the following form:

$$ \scriptsize{\left(\begin{array}{cc:cc:cc:cc}\cos n\theta_0 & -\sin n\theta_0 & 0 & 0 & \cdots & \cdots & 0 & 0 \\\sin n\theta_0 & \cos n\theta_0 & 0 & 0 & \cdots & \cdots & 0 & 0 \\\hdashline 0 & 0 & \cos n\theta_1 & -\sin n\theta_1 & \cdots & \cdots & 0 & 0 \\0 & 0 & \sin n\theta_1 & \cos n\theta_1 & \cdots & \cdots & 0 & 0 \\\hdashline \vdots & \vdots & \vdots & \vdots & \ddots & \ddots & \vdots & \vdots \\\vdots & \vdots & \vdots & \vdots & \ddots & \ddots & \vdots & \vdots \\\hdashline 0 & 0 & 0 & 0 & \cdots & \cdots & \cos n\theta_{d/2-1} & -\sin n\theta_{d/2-1} \\0 & 0 & 0 & 0 & \cdots & \cdots & \sin n\theta_{d/2-1} & \cos n\theta_{d/2-1} \\\end{array}\right)} $$

It can be briefly written as

$$ \begin{pmatrix}\boldsymbol{R}_{n\theta_0} & \boldsymbol{0} & \cdots & \boldsymbol{0} \\\boldsymbol{0} & \boldsymbol{R}_{n\theta_1} & \cdots & \boldsymbol{0} \\\vdots & \vdots & \ddots & \vdots \\\boldsymbol{0} & \boldsymbol{0} & \cdots & \boldsymbol{R}_{n\theta_{d/2-1}} \\\end{pmatrix} = \exp n\begin{pmatrix}\boldsymbol{J}_{\theta_0} & \boldsymbol{0} & \cdots & \boldsymbol{0} \\\boldsymbol{0} & \boldsymbol{J}_{\theta_1} & \cdots & \boldsymbol{0} \\\vdots & \vdots & \ddots & \vdots \\\boldsymbol{0} & \boldsymbol{0} & \cdots & \boldsymbol{J}_{\theta_{d/2-1}} \\\end{pmatrix} $$

where

$$ \boldsymbol{R}_{\theta} = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix},\quad \boldsymbol{J}_{\theta} = \begin{pmatrix} 0 & -\theta \\ \theta & 0\end{pmatrix} $$

This choice is arguably the simplest one, primarily due to the reduction in computation. So, the question of completeness is to answer: Does the special case of the block-diagonal matrix shown above have a deficiency in capability compared to the fully parameterized $\exp n\boldsymbol{B}$? In other words, disregarding computation, would replacing $\boldsymbol{B}$ with a general skew-symmetric matrix possibly improve the effect?

Answering this question is not difficult. In fact, any skew-symmetric matrix of even order can be diagonalized into a block-diagonal matrix

$$ \boldsymbol{\Lambda} = \begin{pmatrix}\boldsymbol{J}_{\theta_0} & \boldsymbol{0} & \cdots & \boldsymbol{0} \\\boldsymbol{0} & \boldsymbol{J}_{\theta_1} & \cdots & \boldsymbol{0} \\\vdots & \vdots & \ddots & \vdots \\\boldsymbol{0} & \boldsymbol{0} & \cdots & \boldsymbol{J}_{\theta_{d/2-1}} \\\end{pmatrix} $$

This conclusion can be found in Skew-symmetric matrix. This means there exists an invertible matrix $\boldsymbol{P}$ such that $\boldsymbol{B}=\boldsymbol{P}\boldsymbol{\Lambda}\boldsymbol{P}^{-1}$, and thus

$$ \exp n\boldsymbol{B} = \exp \left(n\boldsymbol{P}\boldsymbol{\Lambda}\boldsymbol{P}^{-1}\right) = \boldsymbol{P}(\exp n\boldsymbol{\Lambda})\boldsymbol{P}^{-1} $$

In other words, any $\exp n\boldsymbol{B}$ and the block-diagonal $\exp n\boldsymbol{\Lambda}$ only differ by a similarity transformation. When applying RoPE in Self Attention, we have

$$ \boldsymbol{q}^{\top}(\exp (n-m)\boldsymbol{B})\boldsymbol{k} = (\boldsymbol{P}^{\top}\boldsymbol{q})^{\top}(\exp (n-m)\boldsymbol{\Lambda})(\boldsymbol{P}^{-1}\boldsymbol{k}) $$

Since $\boldsymbol{q}, \boldsymbol{k}$ are generally obtained from the input $\boldsymbol{x}$ through some learnable linear transformation, $\boldsymbol{P}^{\top}, \boldsymbol{P}^{-1}$ can, in principle, be absorbed into the trainable parameters of the linear transformation. Therefore, directly setting it as $\boldsymbol{q}^{\top}(\exp (n-m)\boldsymbol{\Lambda})\boldsymbol{k}$ theoretically does not lose generality.

So, for Self Attention, the answer to the question is no. However, for Linear Attention, the answer is slightly different because Linear Attention applies an activation function to $\boldsymbol{q}, \boldsymbol{k}$:

$$ \phi(\boldsymbol{q})^{\top}(\exp (n-m)\boldsymbol{B})\varphi(\boldsymbol{k}) = (\boldsymbol{P}^{\top}\phi(\boldsymbol{q}))^{\top}(\exp (n-m)\boldsymbol{\Lambda})(\boldsymbol{P}^{-1}\varphi(\boldsymbol{k})) $$

This leads to $\boldsymbol{P}^{\top}, \boldsymbol{P}^{-1}$ not necessarily being absorbable into the trainable parameters of the linear transformation. Therefore, adding two parameter matrices for Linear Attention is potentially beneficial.

Summary (formatted)
#

This article briefly analyzed the completeness of RoPE, showing that for Self Attention, the current block-diagonal RoPE does not lose generality.

@online{kexuefm-9403,
        title={The Path to Transformer Upgrade: 6. Completeness Analysis of Rotational Positional Encoding},
        author={苏剑林},
        year={2022},
        month={12},
        url={\url{https://kexue.fm/archives/9403}},
}