This is a gemini-2.5-flash
translation of a Chinese article.
It has NOT been vetted for errors. You should have the original article open in a parallel tab at all times.
By Su Jianlin | 2025-08-21 | 163 Readers
Readers who have finished the first three articles are likely familiar with the “routine” of this series—first, proposing constraints on the update quantity, finding the steepest descent direction, and then adding constraints to the parameters to find a new steepest descent direction. When solving parameter-constrained problems, we adopt the “first-order approximation is sufficient” principle to simplify the constraint form, which geometrically corresponds to the “tangent space.” Then, we use the method of undetermined coefficients to convert to an unconstrained form to derive an analytical solution, and finally numerically solve for the undetermined coefficients.
In this article, we will solve another new example—Muon under spectral sphere constraints—which is an analogous generalization of the first article “Steepest Descent on Manifolds: 1. SGD + Hypersphere”. It can be considered when we want the spectral norm of the parameters to remain constant. Of course, it can also simply serve as an exercise to practice.
Problem Description#
In “Steepest Descent on Manifolds: 2. Muon + Orthogonality” and “Steepest Descent on Manifolds: 3. Muon + Stiefel”, we have already discussed in detail the collision of Muon with orthogonal constraints, so we will not elaborate on the relevant background and directly present the problem formulation:
$$ \max_{\boldsymbol{\Phi}} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{s.t.}\qquad \Vert\boldsymbol{\Phi}\Vert_2 = 1,\,\, \Vert\boldsymbol{W}\Vert_2 = 1,\,\,\Vert\boldsymbol{W} - \eta \boldsymbol{\Phi}\Vert_2=1 $$where $\boldsymbol{W},\boldsymbol{\Phi}\in\mathbb{R}^{n\times m}(n \geq m)$, and $\Vert\cdot\Vert_2$ is the spectral norm. Of course, if we are interested, the latter two spectral norms can be replaced with other norms, such as the Frobenius norm, which represents the combination of “Muon + Hypersphere.”
The “first-order approximation is sufficient” principle requires us to find the gradient of the spectral norm. We have already introduced this in “Thoughts on Spectral Norm Gradient and Novel Weight Decay” and “Derivatives of SVD”. The answer is $\nabla_{\boldsymbol{W}}\Vert\boldsymbol{W}\Vert_2=\boldsymbol{u}_1 \boldsymbol{v}_1^{\top}$, where $\boldsymbol{u}_1,\boldsymbol{v}_1$ are the two singular vectors corresponding to the largest singular value of $\boldsymbol{W}$, which can be solved by power iteration. This result also assumes that the largest singular value is unique; we will discuss the non-unique case later.
If it is the Frobenius norm, then $\nabla_{\boldsymbol{W}}\Vert\boldsymbol{W}\Vert_F=\boldsymbol{W}/\Vert\boldsymbol{W}\Vert_F$. In summary, regardless of the norm, there exists a matrix $\boldsymbol{\Theta}$ that depends only on $\boldsymbol{W}$, such that $\nabla_{\boldsymbol{W}}\Vert\boldsymbol{W}\Vert=\boldsymbol{\Theta}$. Then, from $\Vert\boldsymbol{W}\Vert = 1$ and $\Vert\boldsymbol{W} - \eta \boldsymbol{\Phi}\Vert=1$, its first-order approximation can be obtained as $0 = \langle\boldsymbol{\Theta},\boldsymbol{\Phi}\rangle_F = \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})$. Therefore, under first-order approximation, the general formulation for this type of problem is:
$$ \max_{\boldsymbol{\Phi}} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{s.t.}\qquad \Vert\boldsymbol{\Phi}\Vert_2 = 1,\,\, \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0 $$Undetermined Coefficients#
The routine is still the same: introducing the undetermined coefficient $\lambda$, we have
$$ \begin{aligned} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) =&\, \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) + \lambda \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi}) \\ =&\, \tr((\boldsymbol{G} + \lambda\boldsymbol{\Theta})^{\top}\boldsymbol{\Phi}) \\ \leq &\, \Vert\boldsymbol{G} + \lambda\boldsymbol{\Theta}\Vert_* \end{aligned} $$The last inequality is a result of Muon itself, with equality holding when
$$ \boldsymbol{\Phi} = \msign(\boldsymbol{G} + \lambda\boldsymbol{\Theta}) $$The next task is to solve for a $\lambda$ such that it satisfies the constraint condition $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0$, and then the problem is solved.
Due to the presence of $\msign$, $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0$ is actually a nonlinear equation. The author tends to believe it has no analytical solution, thus seeking numerical methods. However, with the experience from “Steepest Descent on Manifolds: 3. Muon + Stiefel”, we can calmly construct an iterative method for such equations.
Iterative Solution#
First, according to the definition $\msign(\boldsymbol{M}) = \boldsymbol{M}(\boldsymbol{M}^{\top}\boldsymbol{M})^{-1/2}$, we can write $\boldsymbol{\Phi}=(\boldsymbol{G} + \lambda\boldsymbol{\Theta})\boldsymbol{Q}^{-1}$, where $\boldsymbol{Q}=((\boldsymbol{G} + \lambda\boldsymbol{\Theta})^{\top}(\boldsymbol{G} + \lambda\boldsymbol{\Theta}))^{1/2}$, then
$$ \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0\qquad\Rightarrow\qquad \lambda = -\frac{\tr(\boldsymbol{\Theta}^{\top}\boldsymbol{G}\boldsymbol{Q}^{-1})}{\tr(\boldsymbol{\Theta}^{\top}\boldsymbol{\Theta}\boldsymbol{Q}^{-1})} $$But it should be noted that this is not an analytical solution, because $\boldsymbol{Q}$ also depends on $\lambda$. However, based on the above equation, we can construct an iterative scheme: substitute an initial $\lambda$ to find $\boldsymbol{Q}= (\boldsymbol{G} + \lambda\boldsymbol{\Theta})^{\top}\boldsymbol{\Phi}$, then substitute it into the above equation to update $\lambda$, repeating until convergence.
However, although this iterative scheme is theoretically feasible, it requires calculating $\boldsymbol{Q}^{-1}$. Although we have mentioned efficient algorithms for solving it in “Efficient Computation of Matrix r-th Roots and Inverse r-th Roots”, from the perspective of “Occam’s razor” (literally “do not multiply entities beyond necessity”), we still hope to avoid iterations other than those involving $\msign$. Therefore, the author attempts to find another iterative scheme that does not require calculating $\boldsymbol{Q}^{-1}$. To this end, we first write
$$ \boldsymbol{\Theta}^{\top} \boldsymbol{\Phi} = \boldsymbol{\Theta}^{\top}(\boldsymbol{G} + \lambda\boldsymbol{\Theta})\boldsymbol{Q}^{-1} $$For our goal, the trace of the above expression is zero. We can explicitly subtract $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})\boldsymbol{I}/m$ from the left side of the above expression to ensure it satisfies this condition
$$ \boldsymbol{\Theta}^{\top} \boldsymbol{\Phi} - \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})\boldsymbol{I}/m = \boldsymbol{\Theta}^{\top}(\boldsymbol{G} + \lambda\boldsymbol{\Theta})\boldsymbol{Q}^{-1} $$Now, multiplying both sides by $\boldsymbol{Q}$ and then taking the trace allows us to solve for $\lambda$
$$ \lambda = \frac{\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi}\boldsymbol{Q}) - \tr(\boldsymbol{\Theta}^{\top}\boldsymbol{\Phi}) \tr(\boldsymbol{Q})/m - \tr(\boldsymbol{\Theta}^{\top}\boldsymbol{G})}{\tr(\boldsymbol{\Theta}^{\top}\boldsymbol{\Theta})} $$This way, iterations do not require calculating $\boldsymbol{Q}^{-1}$.
Reference Code#
We use $\lambda=-\tr(\boldsymbol{\Theta}^{\top}\boldsymbol{G})/\tr(\boldsymbol{\Theta}^{\top}\boldsymbol{\Theta})$ as the initial value, and the test code is as follows:
import numpy as np
def msign(g):
"""奇异值分解精确计算msign
"""
u, s, vh = np.linalg.svd(g, full_matrices=False)
return u @ np.diag(np.sign(s)) @ vh
def dot(a, b):
"""恒等于 np.trace(a.T @ b)
"""
return (a * b).sum()
n, m = 100, 50
w = np.random.randn(n, m) / m**0.5
g = np.random.randn(n, m) / m**0.5
u, s, vh = np.linalg.svd(w, full_matrices=False)
theta = u[:, :1] @ vh[:1]
lamb = - dot(theta, g) / dot(theta, theta)
for i in range(10):
phi = msign(z := g + lamb * theta)
print('step:', i, ', inner product:', dot(phi, g), ', tangent error:', dot(theta, phi))
q, x = z.T @ phi, theta.T @ phi
lamb = (dot(x, q) - np.trace(x) * np.trace(q) / m - dot(theta, g)) / dot(theta, theta)
Other Details#
Similar to the previous three articles, because the “first-order approximation is sufficient” principle is used, the spectral norm of $\boldsymbol{W} - \eta\boldsymbol{\Phi}$ is accurate to $1 + \mathcal{O}(\eta^2)$, and usually cannot be exactly 1. Therefore, we also need to perform a spectral normalization:
$$ \boldsymbol{W}\quad\leftarrow\quad \frac{\boldsymbol{W} - \eta\boldsymbol{\Phi}}{\Vert\boldsymbol{W} - \eta\boldsymbol{\Phi}\Vert_2} $$Fortunately, the spectral norm can be efficiently calculated via power iteration, so this is not a particularly expensive computation (compared to the iteration of $\msign$ itself).
Another scenario worth analyzing is when the largest singular value is not unique. This special case can be ignored in practical numerical computations, but for theoretical completeness, it should be included in the analysis. In this case, the corresponding singular vectors are also not unique, which is equivalent to having multiple distinct tangent spaces, and the actual feasible space is the intersection of these tangent spaces. Taking the example of two largest singular values, the problem becomes
$$ \max_{\boldsymbol{\Phi}} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{s.t.}\qquad \Vert\boldsymbol{\Phi}\Vert_2 = 1,\,\, \tr(\boldsymbol{\Theta}_1^{\top} \boldsymbol{\Phi})=0,\,\, \tr(\boldsymbol{\Theta}_2^{\top} \boldsymbol{\Phi})=0 $$Here $\boldsymbol{\Theta}_1=\boldsymbol{u}_1 \boldsymbol{v}_1^{\top}, \boldsymbol{\Theta}_2=\boldsymbol{u}_2 \boldsymbol{v}_2^{\top}$. Introducing two undetermined coefficients $\lambda_1,\lambda_2$, we can solve for
$$ \boldsymbol{\Phi} = \msign(\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2) $$The next step is to solve the system of equations $\tr(\boldsymbol{\Theta}_1^{\top} \boldsymbol{\Phi})=0,\tr(\boldsymbol{\Theta}_2^{\top} \boldsymbol{\Phi})=0$. Similarly, introduce
$$ \boldsymbol{Q}=((\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2)^{\top}(\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2))^{1/2} = (\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2)^{\top}\boldsymbol{\Phi} $$The system of equations can be written as
$$ \begin{gathered} \boldsymbol{\Theta}_1^{\top} \boldsymbol{\Phi} - \tr(\boldsymbol{\Theta}_1^{\top} \boldsymbol{\Phi})\boldsymbol{I}/m = \boldsymbol{\Theta}_1^{\top}(\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2)\boldsymbol{Q}^{-1} \\ \boldsymbol{\Theta}_2^{\top} \boldsymbol{\Phi} - \tr(\boldsymbol{\Theta}_2^{\top} \boldsymbol{\Phi})\boldsymbol{I}/m = \boldsymbol{\Theta}_2^{\top}(\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2)\boldsymbol{Q}^{-1} \\ \end{gathered} $$Multiplying both sides by $\boldsymbol{Q}$ and then taking the trace transforms it into a system of two linear equations in $\lambda_1,\lambda_2$ that can be solved. Based on this, an iterative solution scheme can be constructed. The details will not be elaborated here; interested readers can complete them on their own for practice.
Summary (formatted)#
This article mainly considered the Muon form corresponding to parameters constrained by a spectral norm or a general norm. Building upon the first three articles, this piece presents no obvious technical difficulties, and readers can simply regard it as an supplementary exercise to practice.
@online{kexuefm-11241,
title={Steepest Descent on Manifolds: 4. Muon + Spectral Sphere},
author={苏剑林},
year={2025},
month={08},
url={\url{https://kexue.fm/archives/11241}},
}