Skip to main content

5. Linear Attention as Infinite Dimension

·1066 words
Table of Contents
Road to a better Transformer - This article is part of a series.
Part 5: This Article
This is a gemini-2.5-flash-preview-04-17 translation of a Chinese article. Beware of potential errors.

In 《Performer: Linearizing Attention Complexity with Random Projections》, we learned about the Performer model proposed by Google, which introduces a random projection scheme that can transform standard Attention into linear Attention while maintaining a certain level of approximation. Theoretically, as long as the projection dimension is large enough, it can approximate standard Attention sufficiently well. In other words, standard Attention can be viewed as infinite-dimensional linear Attention.

This article will introduce two other approaches I have conceived for transforming standard Attention into infinite-dimensional linear Attention. Unlike Performer’s random projection, these two approaches are deterministic and allow for a relatively convenient perception of the approximation degree.

Brief Introduction
#

I won’t go into too much detail about standard Attention and linear Attention here. Readers who are not yet familiar with them can refer to my previous articles 《Exploring Linear Attention: Does Attention Necessarily Need a Softmax?》 and 《The Transformer Evolution: 3. From Performer to Linear Attention》. Simply put, the calculation method for standard Attention is

$$ a_{i,j}=\frac{e^{\boldsymbol{q}_i\cdot \boldsymbol{k}_j}}{\sum\limits_j e^{\boldsymbol{q}_i\cdot \boldsymbol{k}_j}} $$

while the calculation method for linear Attention is

$$ a_{i,j}=\frac{\phi(\boldsymbol{q}_i)\cdot \varphi(\boldsymbol{k}_j)}{\sum\limits_j \phi(\boldsymbol{q}_i)\cdot \varphi(\boldsymbol{k}_j)} $$

Therefore, to (approximately) transform standard Attention into linear Attention, one generally needs to find transformations $\phi,\varphi$ such that

$$ \phi(\boldsymbol{q})\cdot \varphi(\boldsymbol{k})\approx e^{\boldsymbol{q}\cdot \boldsymbol{k}} $$

Here, $e^{\boldsymbol{q}\cdot \boldsymbol{k}}$ is the “kernel function” in kernel methods.

Random Projection
#

Performer found the first relatively practical random projection transformation scheme. Essentially, it is based on the following integral:

$$ \begin{aligned} e^{\boldsymbol{q}\cdot \boldsymbol{k}} =&\, \frac{1}{(2\pi)^{d/2}}\int e^{-\Vert\boldsymbol{\omega}-\boldsymbol{q}-\boldsymbol{k}\Vert^2 / 2 + \boldsymbol{q}\cdot \boldsymbol{k}}d\boldsymbol{\omega}\\ =&\, \frac{1}{(2\pi)^{d/2}}\int e^{-\Vert\boldsymbol{\omega}\Vert^2 / 2}\times e^{\boldsymbol{\omega}\cdot \boldsymbol{q}-\Vert \boldsymbol{q}\Vert^2 / 2} \times e^{\boldsymbol{\omega}\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2}d\boldsymbol{\omega} \\ \end{aligned} $$

which yields

$$ \begin{aligned} e^{\boldsymbol{q}\cdot \boldsymbol{k}}&=\mathbb{E}_{\boldsymbol{\omega}\sim \mathcal{N}(\boldsymbol{\omega};0,\boldsymbol{1}_d)}\left[e^{\boldsymbol{\omega}\cdot \boldsymbol{q}-\Vert \boldsymbol{q}\Vert^2 / 2} \times e^{\boldsymbol{\omega}\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2}\right]\\[6pt] &\approx\underbrace{\frac{1}{\sqrt{m}}\begin{pmatrix}e^{\boldsymbol{\omega}_1\cdot \boldsymbol{q}-\Vert \boldsymbol{q}\Vert^2 / 2} \\ e^{\boldsymbol{\omega}_2\cdot \boldsymbol{q}-\Vert \boldsymbol{q}\Vert^2 / 2}\\ \vdots\\ e^{\boldsymbol{\omega}_m\cdot \boldsymbol{q}-\Vert \boldsymbol{q}\Vert^2 / 2} \end{pmatrix}}_{\phi(\boldsymbol{q})} \cdot \underbrace{\frac{1}{\sqrt{m}}\begin{pmatrix}e^{\boldsymbol{\omega}_1\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2} \\ e^{\boldsymbol{\omega}_2\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2}\\ \vdots\\ e^{\boldsymbol{\omega}_m\cdot \boldsymbol{k}-\Vert \boldsymbol{k}\Vert^2 / 2} \end{pmatrix}}_{\varphi(\boldsymbol{k})} \end{aligned} $$

where $\boldsymbol{\omega}_1,\boldsymbol{\omega}_2,\cdots,\boldsymbol{\omega}_m\sim \mathcal{N}(\boldsymbol{\omega};0,\boldsymbol{1}_d)$. Thus, using the idea of random projection, we have approximately transformed the exponentiated dot product of two $d$-dimensional vectors into the dot product of two $m$-dimensional vectors. When $m\to\infty$, the two are theoretically equal.

The above random projection scheme is quite clever and not easy to come up with. Below, I will introduce two schemes I conceived, which are relatively easier to understand, especially for readers familiar with kernel functions, who might grasp them at a glance.

Taylor Expansion
#

My first idea is based on the Taylor expansion:

$$ e^{\boldsymbol{q}\cdot \boldsymbol{k}} = \sum_{m=1}^{\infty} \frac{(\boldsymbol{q}\cdot \boldsymbol{k})^m}{m!} $$

Truncating to the first $n+1$ terms, we get an $n$-degree polynomial in $\boldsymbol{q}\cdot \boldsymbol{k}$:

$$ e^{\boldsymbol{q}\cdot \boldsymbol{k}} \approx 1 + \boldsymbol{q}\cdot \boldsymbol{k} + \frac{1}{2}(\boldsymbol{q}\cdot \boldsymbol{k})^2 + \cdots + \frac{1}{n!}(\boldsymbol{q}\cdot \boldsymbol{k})^n $$

This is essentially a “polynomial kernel function”. Note that we have:

$$ \begin{aligned} (\boldsymbol{q}\cdot \boldsymbol{k})^m =&\, \left(\sum_i q_i k_i\right)^m = \left(\sum_{i_1} q_{i_1} k_{i_1}\right)\cdots\left(\sum_{i_m} q_{i_m} k_{i_m}\right) \\ =&\, \sum_{i_1,\cdots,i_m} (q_{i_1}\cdots q_{i_m}) (k_{i_1}\cdots k_{i_m}) \end{aligned} $$

If we view $q_{i_1}\cdots q_{i_m}$ and $k_{i_1}\cdots k_{i_m}$ respectively as large $d^m$-dimensional vectors, then $(\boldsymbol{q}\cdot \boldsymbol{k})^m$ is the dot product of these two large vectors. In fact, the operation of obtaining a “large vector” from several vectors is called the vector’s “outer product”, also known as the “tensor product”, often denoted by $\otimes$. In this case,

$$ \frac{1}{m!}(\boldsymbol{q}\cdot \boldsymbol{k})^m = \frac{1}{m!}\underbrace{(\boldsymbol{q}\otimes\cdots\otimes\boldsymbol{q})}_{m\text{个}\boldsymbol{q}}\cdot\underbrace{(\boldsymbol{k}\otimes\cdots\otimes\boldsymbol{k})}_{m\text{个}\boldsymbol{k}} = \left(\frac{\otimes^m\boldsymbol{q}}{\sqrt{m!}}\right)\cdot\left(\frac{\otimes^m\boldsymbol{k}}{\sqrt{m!}}\right) $$

Here, $\otimes^m\boldsymbol{q}$ and $\otimes^m\boldsymbol{k}$ are shorthand for the $m$-th iterated outer product (outer product raised to the power of $m$) of $\boldsymbol{q}$ and $\boldsymbol{k}$ respectively. Using this result, we have

$$ e^{\boldsymbol{q}\cdot \boldsymbol{k}}\approx \sum_{m=0}^n \left(\frac{\otimes^m\boldsymbol{q}}{\sqrt{m!}}\right)\cdot\left(\frac{\otimes^m\boldsymbol{k}}{\sqrt{m!}}\right) =\underbrace{\begin{pmatrix} 1 \\ \boldsymbol{q}\\ \frac{\otimes^2\boldsymbol{q}}{\sqrt{2}} \\ \vdots\\ \frac{\otimes^n\boldsymbol{q}}{\sqrt{n!}}\end{pmatrix}}_{\phi(\boldsymbol{q})} \cdot \underbrace{\begin{pmatrix} 1 \\ \boldsymbol{k}\\ \frac{\otimes^2\boldsymbol{k}}{\sqrt{2}} \\ \vdots\\ \frac{\otimes^n\boldsymbol{k}}{\sqrt{n!}}\end{pmatrix}}_{\varphi(\boldsymbol{k})} $$

This completes the transformation from standard Attention to linear Attention.

Exponential Definition
#

Compared to Performer’s random projection, the above Taylor expansion-based approach is arguably easier to understand. However, there is an even simpler and more direct idea than Taylor expansion, which is to use the definition of the natural exponential:

$$ e^x = \lim_{n\to\infty} \left(1+\frac{x}{n}\right)^n $$

Therefore, by choosing an appropriate $n$, we have

$$ e^{\boldsymbol{q}\cdot \boldsymbol{k}} \approx \left(1+\frac{{\boldsymbol{q}\cdot \boldsymbol{k}}}{n}\right)^n = \left(\begin{pmatrix} 1 \\ \frac{\boldsymbol{q}}{\sqrt{n}}\end{pmatrix} \cdot \begin{pmatrix}1 \\ \frac{\boldsymbol{k}}{\sqrt{n}}\end{pmatrix}\right)^n $$

Combining this with the polynomial kernel function transformation result from the previous section, we have

$$ e^{\boldsymbol{q}\cdot \boldsymbol{k}} \approx \underbrace{\left(\otimes^n\begin{pmatrix} 1 \\ \frac{\boldsymbol{q}}{\sqrt{n}}\end{pmatrix}\right)}_{\phi(\boldsymbol{q})} \cdot \underbrace{\left(\otimes^n\begin{pmatrix}1 \\ \frac{\boldsymbol{k}}{\sqrt{n}}\end{pmatrix}\right)}_{\varphi(\boldsymbol{k})} $$

This might be the simplest and most direct scheme for transforming standard Attention into linear Attention.

Result Analysis
#

Regarding practical value, the latter two deterministic schemes are far inferior to Performer’s random projection scheme. This is because the output dimension of random projection can be controlled flexibly, while the output dimension of the two deterministic schemes is on the order of $d^n$, which is usually much larger than the sequence length itself. Therefore, using them for linear Attention is generally less efficient than standard Attention.

However, theoretically, the latter two schemes provide simpler and more convenient ideas for equating standard Attention with infinite-dimensional linear Attention. This equivalence can often help us better understand the Attention mechanism, the most direct implication being the understanding of Attention’s rank.

Readers who have studied linear Attention should know that if linear Attention is used for bidirectional attention tasks (like MLM), the performance degradation can be significant. This is because for linear Attention, $\phi(\boldsymbol{Q}),\varphi(\boldsymbol{K})\in\mathbb{R}^{n\times d}$ (where $d$ is the head_size of each head), and usually $n \gg d$. Therefore, the rank of the $n\times n$ Attention matrix obtained from $\phi(\boldsymbol{Q})\varphi(\boldsymbol{K})^{\top}$ is at most $d$. This is the low-rank issue of linear Attention, and the low rank limits its expressive power.

Conversely, the three transformations introduced earlier all tell us that standard Attention can be viewed as infinite-dimensional linear Attention. Therefore, the rank of standard Attention is theoretically not limited by $d$, which is why standard Attention with the same number of parameters often performs better than linear Attention. As we also mentioned in 《The Transformer Evolution: 3. From Performer to Linear Attention》, if standard Attention is to be switched to linear Attention, $d$ also needs to be correspondingly increased to maintain a certain degree of approximation in performance.

Summary (formatted)
#

This article introduced three perspectives for viewing standard Attention as infinite-dimensional linear Attention. These different viewpoints allow us to connect standard Attention with linear Attention, providing a more comprehensive understanding of the Attention mechanism from multiple angles.

@online{kexuefm-8601,
        title={The Transformer Evolution: 5. Linear Attention as Infinite Dimension},
        author={苏剑林},
        year={2021},
        month={08},
        url={\url{https://kexue.fm/archives/8601}},
}
Road to a better Transformer - This article is part of a series.
Part 5: This Article