3. From Performer to Linear Attention

Readers who have read my previous articles “Exploring Linear Attention: Does Attention Have to Have Softmax?” and “Performer: Linearizing Attention Complexity with Random Projections” might find the title of this article a bit unnatural. This is because linear Attention came first, and then Performer, so the relationship is “Performer is an implementation of linear Attention, maintaining approximate standard Attention while ensuring linear complexity.” Therefore, it should normally be “From Linear Attention to Performer.”

However, this article is not intended to trace the development history of linear Attention, but rather to think in reverse about the启示 that Performer brings to linear Attention, hence “From Performer to Linear Attention.”

Activation Function
#

The common form of linear Attention is

$$ Attention(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V})_i = \frac{\sum\limits_{j=1}^n \text{sim}(\boldsymbol{q}_i, \boldsymbol{k}_j)\boldsymbol{v}_j}{\sum\limits_{j=1}^n \text{sim}(\boldsymbol{q}_i, \boldsymbol{k}_j)} = \frac{\sum\limits_{j=1}^n \phi(\boldsymbol{q}_i)^{\top} \varphi(\boldsymbol{k}_j)\boldsymbol{v}_j}{\sum\limits_{j=1}^n \phi(\boldsymbol{q}_i)^{\top} \varphi(\boldsymbol{k}_j)} $$

where $\phi(\cdot)$ and $\varphi(\cdot)$ are activation functions with non-negative ranges. How to choose these activation functions? Performer tells us that the exponential function should be chosen

$$ \phi(x)=\varphi(x)=e^x $$

First, let’s see how this differs from existing results. The choice given in “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention” is:

$$ \phi(x)=\varphi(x)=1 + \text{elu}(x) = \left\{\begin{aligned}1 + x,\, x \geq 0\\ e^x,\, x < 0\end{aligned}\right. $$

We know that $1+x$ is the first-order Taylor expansion of $e^x$ at $x=0$. Therefore, the choice $1+\text{elu}(x)$ is actually quite close to $e^x$.

Furthermore, the scheme $\phi(x)=\varphi(x)=e^x$ is also very similar to the linear Attention design using dual softmax introduced in the paper “Efficient Attention: Attention with Linear Complexities”. In that design, $\phi(\boldsymbol{q})=softmax(\boldsymbol{q}),\varphi(\boldsymbol{k})=e^{\boldsymbol{k}}$, which is just a different normalization position compared to directly using $\phi(x)=\varphi(x)=e^x$.

Simple Derivation
#

Why do we say Performer tells us that the best choice for the activation function is $e^x$? Let’s look at the mapping Performer found to linearize standard Attention:

$$ \begin{aligned} e^{\boldsymbol{q}\cdot \boldsymbol{k}}&=\mathbb{E}_{\boldsymbol{\omega}\sim \mathcal{N}(\boldsymbol{\omega};0,\boldsymbol{1}_d)}\left[e^{\boldsymbol{\omega}\cdot \boldsymbol{q}-\Vert \boldsymbol{q}\Vert^2 / 2} \times e^{\boldsymbol{\omega}\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2}\right]\\[6pt] &\approx\underbrace{\frac{1}{\sqrt{m}}\begin{pmatrix}e^{\boldsymbol{\omega}_1\cdot \boldsymbol{q}-\Vert \boldsymbol{q}\Vert^2 / 2} \\ e^{\boldsymbol{\omega}_2\cdot \boldsymbol{q}-\Vert \boldsymbol{q}\Vert^2 / 2}\\ \vdots\\ e^{\boldsymbol{\omega}_m\cdot \boldsymbol{q}-\Vert \boldsymbol{q}\Vert^2 / 2} \end{pmatrix}}_{\tilde{\boldsymbol{q}}} \cdot \underbrace{\frac{1}{\sqrt{m}}\begin{pmatrix}e^{\boldsymbol{\omega}_1\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2} \\ e^{\boldsymbol{\omega}_2\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2}\\ \vdots\\ e^{\boldsymbol{\omega}_m\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2} \end{pmatrix}}_{\tilde{\boldsymbol{k}}} \end{aligned} $$

Simply put, Performer found a mapping that maps $d$-dimensional vectors $\boldsymbol{q},\boldsymbol{k}$ to $m$-dimensional vectors $\tilde{\boldsymbol{q}},\tilde{\boldsymbol{k}}$, satisfying the approximation $e^{\boldsymbol{q}\cdot \boldsymbol{k}}\approx \tilde{\boldsymbol{q}}\cdot\tilde{\boldsymbol{k}}$. In this case,

$$ a_{i,j} = \frac{e^{\boldsymbol{q}_i\cdot \boldsymbol{k}_j}}{\sum\limits_j e^{\boldsymbol{q}_i\cdot \boldsymbol{k}_j}}\approx \frac{\tilde{\boldsymbol{q}}_i\cdot\tilde{\boldsymbol{k}}_j}{\sum\limits_j \tilde{\boldsymbol{q}}_i\cdot\tilde{\boldsymbol{k}}_j} = \frac{(\lambda(\tilde{\boldsymbol{q}}_i)\tilde{\boldsymbol{q}}_i)\cdot\tilde{\boldsymbol{k}}_j}{\sum\limits_j (\lambda(\tilde{\boldsymbol{q}}_i)\tilde{\boldsymbol{q}}_i)\cdot\tilde{\boldsymbol{k}}_j} $$

The last equality shows that multiplying $\tilde{\boldsymbol{q}}$ by a constant (even if this constant depends on $\tilde{\boldsymbol{q}}$) does not change the Performer result. This means that changing the mapping to

$$ \tilde{\boldsymbol{q}} = \begin{pmatrix}e^{\boldsymbol{\omega}_1\cdot \boldsymbol{q}} \\ e^{\boldsymbol{\omega}_2\cdot \boldsymbol{q}}\\ \vdots\\ e^{\boldsymbol{\omega}_m\cdot \boldsymbol{q}} \end{pmatrix},\qquad \tilde{\boldsymbol{k}}=\begin{pmatrix}e^{\boldsymbol{\omega}_1\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2} \\ e^{\boldsymbol{\omega}_2\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2}\\ \vdots\\ e^{\boldsymbol{\omega}_m\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2} \end{pmatrix} $$

will not change the Performer result. Of course, the term $\Vert \boldsymbol{k}\Vert^2$ still cannot be removed here, but if we assume that $\Vert \boldsymbol{k}\Vert^2$ does not fluctuate too much and is not the main factor in Attention, then this term is also equivalent to a constant. Thus, the final mapping is (approximately) equivalent to

$$ \tilde{\boldsymbol{q}} = \begin{pmatrix}e^{\boldsymbol{\omega}_1\cdot \boldsymbol{q}} \\ e^{\boldsymbol{\omega}_2\cdot \boldsymbol{q}}\\ \vdots\\ e^{\boldsymbol{\omega}_m\cdot \boldsymbol{q}} \end{pmatrix},\qquad \tilde{\boldsymbol{k}}=\begin{pmatrix}e^{\boldsymbol{\omega}_1\cdot \boldsymbol{k}} \\ e^{\boldsymbol{\omega}_2\cdot \boldsymbol{k}}\\ \vdots\\ e^{\boldsymbol{\omega}_m\cdot \boldsymbol{k}} \end{pmatrix} $$

How should we understand this seemingly much simplified mapping? In fact, the $m$ random vectors $\boldsymbol{\omega}_1,\boldsymbol{\omega}_2,\cdots,\boldsymbol{\omega}_m$ form a $d\times m$ matrix, which maps $d$-dimensional $\boldsymbol{q},\boldsymbol{k}$ to $m$-dimensional vectors, and then applies the activation function $e^x$ to obtain $\tilde{\boldsymbol{q}},\tilde{\boldsymbol{k}}$. We know that Attention’s $\boldsymbol{q},\boldsymbol{k}$ have a dense layer transformation. If we integrate this $d\times m$ mapping matrix into the dense layer, what remains is just the activation function $e^x$!

So this is the origin of the optimal activation function $e^x$. As long as we change the output dimension of $\boldsymbol{q},\boldsymbol{k}$ from $d$ to $m$, and use the $e^x$ activation function, theoretically it will have the fitting capability of Performer, or even stronger, because Performer’s $d\times m$ matrix is a fixed random matrix, while here we are essentially making this matrix trainable and removing the low-rank constraint, resulting in a larger space than Performer.

Low-Rank Problem
#

Whether it’s Performer, the protagonist of this article, or Nyströmformer introduced in “Nyströmformer: A Linear Attention Scheme Based on Matrix Decomposition”, their idea is to “find a linear Attention that can approximate standard Attention.” So a natural question is: What’s good about standard Attention? Why is it worth aligning with?

From the perspective of information loss, the “rank” of the standard Attention matrix might be larger, i.e., closer to an invertible matrix, which means it can retain more effective information. Specifically, the Attention matrix is an $n\times n$ matrix, derived from $\boldsymbol{Q},\boldsymbol{K}\in\mathbb{R}^{n\times d}$ through $softmax(\boldsymbol{Q}\boldsymbol{K}^{\top})$. Note that here $d$ is the Attention key_size, e.g., for BERT base, it is only 64, while $n$ is often much larger. This indicates that the rank of $\boldsymbol{Q}\boldsymbol{K}^{\top}$ does not exceed $d$, and $d\ll n$, meaning it is far from full rank. However, the key operation of $softmax$ is $e^{\boldsymbol{Q}\boldsymbol{K}^{\top}}$. If each element of a matrix is exponentiated, the rank of the new matrix can potentially increase! So the standard Attention matrix has the possibility of rank increase, meaning it possesses the ability to process information more effectively.

In contrast, the linear Attention matrix is in the form of $\tilde{\boldsymbol{Q}}\tilde{\boldsymbol{K}}^{\top}$, so the rank of the linear Attention matrix is certainly no more than $m$. To compensate for the loss of rank, it is generally necessary to set $m > d$. In Performer’s experiments, they chose $m = 4d$, which means the key_size is increased by 4 times. The importance of rank is evident. Of course, increasing the key_size has the direct consequence that linear Attention is still slower than standard Attention when processing short sequences, which is an inherent bottleneck of linear Attention.

Regarding theoretical analysis of the rank of the Attention matrix, some papers can be referenced, such as “Low-Rank Bottleneck in Multi-head Attention Models”, which points out that even in standard Attention, low-rank is a serious bottleneck, and increasing key_size can improve performance. Last month’s “Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth” points out that without residuals and FFNs, standard Attention has a significant risk of degrading to a simple transformation with rank 1. Even standard Attention, a model with “rank-increasing potential,” has low-rank issues, not to mention linear Attention, a model whose rank has an inherent upper bound.

So, in a nutshell: using linear Attention requires a larger key_size to maintain the rank of the matrix.

Concentrated Attention
#

We can also understand the benefits of standard Attention from the perspective of sparsity. Intuitively, since it is an “attention mechanism,” it must “concentrate attention.” If it is too dispersed, it might be equivalent to average pooling. “Concentrating attention” means that each token should only be significantly related to a few tokens. Mathematically, this means the Attention matrix is sparse, or at least has the potential to become sparse.

For standard Attention, it is normalized by softmax

$$ a_{i,j} = \frac{e^{\boldsymbol{q}_i\cdot \boldsymbol{k}_j}}{\sum\limits_j e^{\boldsymbol{q}_i\cdot \boldsymbol{k}_j}} $$

where the exponential function $e^x$ acts as an amplifier. As long as the $\boldsymbol{q}_i\cdot \boldsymbol{k}_j$ values themselves can be sufficiently separated, $e^{\boldsymbol{q}_i\cdot \boldsymbol{k}_j}$ will further amplify this separation. The result is that after normalization, except for the positions with the largest values, the remaining probabilities are very close to 0. This shows that standard Attention has the potential to “concentrate attention.” For linear Attention, it is the direct result of the dot product, without the further amplification of $e^x$, so its attention is relatively dense. When the sequence length is large, it is often very close to average pooling. To mitigate this, increasing the key_size is still necessary to amplify the differences. Intuitively, $n$ vectors are too “crowded” in a low-dimensional space, and they become “looser” when moved to a higher-dimensional space.

How to verify the importance of sparsity? I once tried calculating the Attention matrix for linear Attention first, and then forcefully truncating the Attention matrix (i.e., each token only attends to a few tokens around it, making it a local form of Attention) to make it sparse. The result was that this truncated linear Attention performed significantly better than the full-matrix linear Attention. This confirmed the importance of sparsity. Of course, calculating the Attention matrix first and then truncating it this way makes the complexity of linear Attention no longer linear, so it is not practical and is only used for theoretical verification.

Another experimental phenomenon can help prove the importance of sparsity: when linear Attention is used for language models or decoders, the performance is almost the same as standard Attention. In this case, linear Attention becomes a unidirectional RNN (refer to “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention”), equivalent to the Attention matrix becoming a lower triangular matrix, which is also sparser. In contrast, if a non-sparse bidirectional linear Attention is used directly for an MLM model, the performance drop is quite noticeable.

More importantly, sparsity is closely related to the rank mentioned in the previous section, and they can even be considered “two sides of the same coin”: appropriate sparsification methods can increase the rank of the matrix! For example, the lower triangular Attention matrix used for language models, as long as the diagonal elements are non-zero (which they usually are), the matrix is immediately full rank and invertible! Also, the local Attention truncation I experimented with can also increase the matrix rank. In extreme cases, if each token only attends to itself, the Attention matrix becomes the full-rank identity matrix!

Summary (formatted)
#

This article started from Performer and discussed some issues with linear Attention, including the choice of activation function for linear Attention and the bottlenecks of linear Attention (low rank, sparsity). The overall conclusion is that the optimal activation function for linear Attention should be the exponential function, and an effective Attention mechanism should have higher rank and greater sparsity.

@online{kexuefm-8338,
        title={The Path to Transformer Upgrade: 3. From Performer to Linear Attention},
        author={苏剑林},
        year={2021},
        month={04},
        url={\url{https://kexue.fm/archives/8338}},
}