This is a gemini-2.5-flash
translation of a Chinese article.
It has NOT been vetted for errors. You should have the original article open in a parallel tab at all times.
In the article 《Muon Optimizer Appreciation: An Essential Leap from Vectors to Matrices》, we introduced a new optimizer named “Muon.” One perspective to understand it is as steepest gradient descent under spectral norm regularization, which seems to reveal a more fundamental optimization direction for matrix parameters. As is well known, for matrix parameters, we often also apply Weight Decay, which can be understood as the gradient of the squared Frobenius norm ($F$-norm). So, from Muon’s perspective, would constructing a new weight decay using the gradient of the squared spectral norm yield better performance?
The question then arises: What does the gradient, or derivative, of the spectral norm look like? And what kind of new weight decay can be designed using it? Next, we will elaborate on these questions.
Basic Review#
Spectral Norm, also known as the “$2$-norm,” is one of the most commonly used matrix norms. Compared to the simpler Frobenius Norm ($F$-norm), it often reveals more fundamental signals related to matrix multiplication. This is because its definition is inherently linked to matrix multiplication: For a matrix parameter $\boldsymbol{W}\in\mathbb{R}^{n\times m}$, its spectral norm is defined as
$$ \Vert\boldsymbol{W}\Vert_2 \triangleq \max_{\Vert\boldsymbol{x}\Vert=1} \Vert\boldsymbol{W}\boldsymbol{x}\Vert $$Here, $\boldsymbol{x}\in\mathbb{R}^m$ is a column vector, and $\Vert\Vert$ on the right side denotes the vector magnitude (Euclidean norm). From another perspective, the spectral norm is the smallest constant $C$ such that the following inequality holds for all $\forall \boldsymbol{x}\in\mathbb{R}^m$:
$$ \Vert\boldsymbol{W}\boldsymbol{x}\Vert \leq C\Vert\boldsymbol{x}\Vert $$It is not difficult to prove that when $C$ is taken as the $F$-norm $\Vert W\Vert_F$, the above inequality also holds. Therefore, we can write $\Vert \boldsymbol{W}\Vert_2\leq \Vert \boldsymbol{W}\Vert_F$ (because $\Vert \boldsymbol{W}\Vert_F$ is just one of the $C$ values for which the inequality holds, while $\Vert \boldsymbol{W}\Vert_2$ is the smallest such $C$). This conclusion also indicates that if we want to control the magnitude of the output, using the spectral norm as a regularization term is more precise than using the $F$-norm.
As early as 6 years ago, in 《Lipschitz Constraints in Deep Learning: Generalization and Generative Models》, we discussed the spectral norm. At that time, there were two application scenarios: First, WGAN explicitly imposed Lipschitz constraints on the discriminator, and one implementation method was based on spectral norm normalization. Second, some works showed that spectral norm as a regularization term achieved better performance compared to $F$-norm regularization.
Gradient Derivation#
Now let’s get to the main topic and attempt to derive the gradient of the spectral norm, $\nabla_{\boldsymbol{W}} \Vert\boldsymbol{W}\Vert_2$. We know that the spectral norm numerically equals its largest singular value. We have previously proven this in the “Matrix Norms” section of 《The Road to Low-Rank Approximation (II): SVD》. This means that if $\boldsymbol{W}$ can be SVD-decomposed as $\sum\limits_{i=1}^{\min(n,m)}\sigma_i \boldsymbol{u}_i\boldsymbol{v}_i^{\top}$, then
$$ \Vert\boldsymbol{W}\Vert_2 = \sigma_1 = \boldsymbol{u}_1^{\top}\boldsymbol{W}\boldsymbol{v}_1 $$where $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(n,m)} \geq 0$ are the singular values of $\boldsymbol{W}$. Differentiating both sides, we get
$$ d\Vert\boldsymbol{W}\Vert_2 = d\boldsymbol{u}_1^{\top}\boldsymbol{W}\boldsymbol{v}_1 + \boldsymbol{u}_1^{\top}d\boldsymbol{W}\boldsymbol{v}_1 + \boldsymbol{u}_1^{\top}\boldsymbol{W}d\boldsymbol{v}_1 $$Note that
$$ d\boldsymbol{u}_1^{\top}\boldsymbol{W}\boldsymbol{v}_1 = d\boldsymbol{u}_1^{\top}\sum_{i=1}^{\min(n,m)}\sigma_i \boldsymbol{u}_i\boldsymbol{v}_i^{\top}\boldsymbol{v}_1 = d\boldsymbol{u}_1^{\top}\sigma_1 \boldsymbol{u}_1 = \frac{1}{2}\sigma_1 d(\Vert\boldsymbol{u}_1\Vert^2)=0 $$Similarly, $\boldsymbol{u}_1^{\top}\boldsymbol{W}d\boldsymbol{v}_1=0$, so
$$ d\Vert\boldsymbol{W}\Vert_2 = \boldsymbol{u}_1^{\top}d\boldsymbol{W}\boldsymbol{v}_1 = \text{Tr}((\boldsymbol{u}_1 \boldsymbol{v}_1^{\top})^{\top} d\boldsymbol{W}) \quad\Rightarrow\quad \nabla_{\boldsymbol{W}}\Vert\boldsymbol{W}\Vert_2 = \boldsymbol{u}_1 \boldsymbol{v}_1^{\top} $$Note that a crucial condition for this proof is $\sigma_1 > \sigma_2$. Because if $\sigma_1=\sigma_2$, $\Vert\boldsymbol{W}\Vert_2$ can be expressed as both $\boldsymbol{u}_1^{\top}\boldsymbol{W}\boldsymbol{v}_1$ and $\boldsymbol{u}_2^{\top}\boldsymbol{W}\boldsymbol{v}_2$. The gradients derived using the same method would be $\boldsymbol{u}_1 \boldsymbol{v}_1^{\top}$ and $\boldsymbol{u}_2 \boldsymbol{v}_2^{\top}$ respectively, and the non-unique result implies that the gradient does not exist. Of course, from a practical perspective, the probability of two numbers being exactly equal is very small, so this point can be ignored.
(Note: The proof process here refers to a reply on Stack Exchange, but that reply did not prove $d\boldsymbol{u}_1^{\top}\boldsymbol{W}\boldsymbol{v}_1=0$ and $\boldsymbol{u}_1^{\top}\boldsymbol{W}d\boldsymbol{v}_1=0$, which the author has supplemented.)
Weight Decay#
Based on this result and the chain rule, we have
$$ \nabla_{\boldsymbol{W}}\left(\frac{1}{2}\Vert\boldsymbol{W}\Vert_2^2\right) = \Vert\boldsymbol{W}\Vert_2\nabla_{\boldsymbol{W}}\Vert\boldsymbol{W}\Vert_2 = \sigma_1 \boldsymbol{u}_1 \boldsymbol{v}_1^{\top} $$Comparing with the result for the $F$-norm:
$$ \nabla_{\boldsymbol{W}}\left(\frac{1}{2}\Vert\boldsymbol{W}\Vert_F^2\right) = \boldsymbol{W} = \sum_{i=1}^{\min(n,m)}\sigma_i \boldsymbol{u}_i \boldsymbol{v}_i^{\top} $$This comparison makes it very clear: The weight decay derived from the squared $F$-norm as a regularization term penalizes all singular values simultaneously. In contrast, the weight decay corresponding to the squared spectral norm only penalizes the largest singular value. If our goal is to compress the output magnitude, then compressing the largest singular value is “just right”; while compressing all singular values might achieve a similar goal, it could also potentially reduce the expressive power of the parameters.
According to the “Eckart-Young-Mirsky Theorem,” the rightmost result of this equation has another meaning: it is the “optimal rank-1 approximation” of the matrix $\boldsymbol{W}$. That is to say, spectral norm weight decay changes the operation of subtracting $\boldsymbol{W}$ itself in each step to subtracting its optimal rank-1 approximation, weakening the penalty strength and, to some extent, making the penalty more “straight to the essence.”
Numerical Computation#
For practical application, the most crucial question arises: how to compute $\sigma_1 \boldsymbol{u}_1 \boldsymbol{v}_1^{\top}$? SVD is certainly the simplest and most direct approach, but its computational complexity is undoubtedly the highest. We must find a more efficient computation method.
Without loss of generality, assume $n\geq m$. First, note that
$$ \sigma_1 \boldsymbol{u}_1 \boldsymbol{v}_1^{\top} = \sum_{i=1}^m\sigma_i \boldsymbol{u}_i \boldsymbol{v}_i^{\top} \boldsymbol{v}_1 \boldsymbol{v}_1^{\top} = \boldsymbol{W}\boldsymbol{v}_1 \boldsymbol{v}_1^{\top} $$Thus, to compute $\sigma_1 \boldsymbol{u}_1 \boldsymbol{v}_1^{\top}$, we only need to know $\boldsymbol{v}_1$. Then, according to our discussion in the “Singular Value Decomposition” section of 《The Road to Low-Rank Approximation (II): SVD》, $\boldsymbol{v}_1$ is actually the eigenvector corresponding to the largest eigenvalue of the matrix $\boldsymbol{W}^{\top}\boldsymbol{W}$. This way, we have transformed the problem from SVD of a general matrix $\boldsymbol{W}$ to eigenvalue decomposition of a real symmetric matrix $\boldsymbol{W}^{\top}\boldsymbol{W}$. This has already reduced complexity, as eigenvalue decomposition is usually significantly faster than SVD.
If it is still too slow, then we need to introduce the principle behind many eigenvalue decomposition algorithms—"Power Iteration":
$$ \boldsymbol{x}_{t+1} = \frac{\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{x}_t}{\Vert\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{x}_t\Vert} $$When $\sigma_1 > \sigma_2$, the iteration
converges to $\boldsymbol{v}_1$ at a rate of $(\sigma_2/\sigma_1)^{2t}$.
Each step of power iteration only requires two “matrix-vector” multiplications, with a complexity of $\mathcal{O}(nm)$. The total complexity for $t$ iterations is $\mathcal{O}(tnm)$, which is very efficient. The drawback is that convergence can be slow when $\sigma_1$ and $\sigma_2$ are close. However, the actual performance of power iteration is often better than theoretically expected, and many early works even achieved good results with only one iteration. This is because when $\sigma_1$ and $\sigma_2$ are close, it indicates that they and their eigenvectors are to some extent interchangeable, and even if power iteration does not fully converge, it yields an average of their eigenvectors, which is entirely sufficient.
Proof of Iteration#
In this section, we will complete the proof of power iteration. It is not difficult to see that power iteration can be equivalently written as
$$ \lim_{t\to\infty} \frac{(\boldsymbol{W}^{\top}\boldsymbol{W})^t \boldsymbol{x}_0}{\Vert(\boldsymbol{W}^{\top}\boldsymbol{W})^t \boldsymbol{x}_0\Vert} = \boldsymbol{v}_1 $$To prove this limit, we start from $\boldsymbol{W}=\sum\limits_{i=1}^m\sigma_i \boldsymbol{u}_i\boldsymbol{v}_i^{\top}$, and substitute to compute
$$ \boldsymbol{W}^{\top}\boldsymbol{W} = \sum_{i=1}^m\sigma_i^2 \boldsymbol{v}_i\boldsymbol{v}_i^{\top},\qquad(\boldsymbol{W}^{\top}\boldsymbol{W})^t = \sum_{i=1}^m\sigma_i^{2t} \boldsymbol{v}_i\boldsymbol{v}_i^{\top} $$Since $\boldsymbol{v}_1,\boldsymbol{v}_2,\cdots,\boldsymbol{v}_m$ form an orthonormal basis for $\mathbb{R}^m$, $\boldsymbol{x}_0$ can be written as $\sum\limits_{j=1}^m c_j \boldsymbol{v}_j$. Thus, we have
$$ (\boldsymbol{W}^{\top}\boldsymbol{W})^t \boldsymbol{x}_0 = \sum_{i=1}^m\sigma_i^{2t} \boldsymbol{v}_i\boldsymbol{v}_i^{\top}\sum_{j=1}^m c_j \boldsymbol{v}_j = \sum_{i=1}^m\sum_{j=1}^m c_j\sigma_i^{2t} \boldsymbol{v}_i\underbrace{\boldsymbol{v}_i^{\top} \boldsymbol{v}_j}_{=\delta_{i,j}} = \sum_{i=1}^m c_i\sigma_i^{2t} \boldsymbol{v}_i $$And
$$ \Vert(\boldsymbol{W}^{\top}\boldsymbol{W})^t \boldsymbol{x}_0\Vert = \left\Vert \sum_{i=1}^m c_i\sigma_i^{2t} \boldsymbol{v}_i\right\Vert = \sqrt{\sum_{i=1}^m c_i^2\sigma_i^{4t}} $$Due to random initialization, the probability of $c_1=0$ is very small, so we can assume $c_1\neq 0$. Then
$$ \frac{(\boldsymbol{W}^{\top}\boldsymbol{W})^t \boldsymbol{x}_0}{\Vert(\boldsymbol{W}^{\top}\boldsymbol{W})^t \boldsymbol{x}_0\Vert} = \frac{\sum\limits_{i=1}^m c_i\sigma_i^{2t} \boldsymbol{v}_i}{\sqrt{\sum\limits_{i=1}^m c_i^2\sigma_i^{4t}}} = \frac{\boldsymbol{v}_1 + \sum\limits_{i=2}^m (c_i/c_1)(\sigma_i/\sigma_1)^{2t} \boldsymbol{v}_i}{\sqrt{1 + \sum\limits_{i=2}^m (c_i/c_1)^2(\sigma_i/\sigma_1)^{4t}}} $$When $\sigma_1 > \sigma_2$, all $\sigma_i/\sigma_1$ for ($i\geq 2$) are less than 1. Therefore, as $t\to \infty$, the corresponding terms become 0, and the final limit is $\boldsymbol{v}_1$.
Related Work#
The earliest paper to propose spectral norm regularization should be 《Spectral Norm Regularization for Improving the Generalizability of Deep Learning》 from 2017. It compared methods such as weight decay, adversarial training, and spectral norm regularization, finding that spectral norm regularization performed best in terms of generalization performance.
The approach taken in that paper was not to derive $\nabla_{\boldsymbol{W}}\Vert\boldsymbol{W}\Vert_2^2 = 2\sigma_1\boldsymbol{u}_1 \boldsymbol{v}_1^{\top}$ as done in this article. Instead, it directly estimated $\Vert\boldsymbol{W}\Vert_2$ via power iteration, then weighted $\Vert\boldsymbol{W}\Vert_2^2$ into the loss function, allowing the optimizer to compute the gradient itself. This approach is slightly less efficient and makes it difficult to decouple it from the optimizer in the form of weight decay. The approach presented in this article is relatively more flexible, allowing us, like AdamW, to treat weight decay independently of the optimization of the main loss function.
Of course, from today’s LLM perspective, the biggest problem with those early experiments was their small scale, which made them less convincing. However, given the precedence of the Muon optimizer with spectral norm, the author believes it is still worthwhile to reconsider and experiment with spectral norm weight decay. Of course, whether it’s $F$-norm or spectral norm weight decay, these “generalization-oriented” techniques often involve an element of luck, so one should approach expectations with a calm mind.
My preliminary experimental results on language models show a possibly slight improvement in the loss (hopefully not a hallucination, and at worst, no degradation was observed). The experimental procedure involved using power iteration to estimate $\boldsymbol{v}_1$ (initialized as an all-ones vector, iterated 10 times), then changing the original weight decay from $-\lambda \boldsymbol{W}$ to $-\lambda \boldsymbol{W}\boldsymbol{v}_1\boldsymbol{v}_1^{\top}$, with the value of $\lambda$ remaining unchanged.
Summary (formatted)#
This article derives the gradient of the spectral norm, thereby deriving a new weight decay, and shares the author’s reflections on it.
@online{kexuefm-10648,
title={Reflections on Novel Weight Decay from Spectral Norm Gradient},
author={苏剑林},
year={2024},
month={12},
url={\url{https://kexue.fm/archives/10648}},
}