2. Muon + Orthogonal

Table of Contents

This is a gemini-2.5-flash translation of a Chinese article.

It has NOT been vetted for errors. You should have the original article open in a parallel tab at all times.

By Su Jianlin | 2025-08-06 | 3963 readers

This article continues our series on constrained optimization. In the previous article, “Steepest Descent on Manifolds: 1. SGD + Hypersphere”, we revisited the “principle of least action” for optimizers, proposing that the core difference among various optimizers lies in the different constraints imposed on the update amount. If this constraint is the Euclidean norm, then the corresponding steepest descent is SGD. Furthermore, we discussed the result of simultaneously adding a norm constraint to the parameters, which constitutes the steepest descent on a hyperspherical manifold.

However, the previous article was merely a “warm-up”, as it dealt with relatively simple vector parameter optimization. This article formally enters the more challenging part—optimizing parameters changing from vectors to matrices, with the incremental constraint updated to the spectral norm, which leads to the Muon optimizer; subsequently, we will add an orthogonal constraint to the parameters, which will yield the Muon optimizer under the orthogonal manifold.

Problem Description
#

Let the parameters to be optimized be in matrix form $\boldsymbol{W}\in\mathbb{R}^{n\times m}$. Without loss of generality, assume $n\geq m$. According to the “principle of least action” from the previous article, the increment $\Delta\boldsymbol{W}$ for steepest descent should satisfy

$$ \min_{\Delta \boldsymbol{W}} \mathcal{L}(\boldsymbol{W} +\Delta\boldsymbol{W}) \qquad \text{s.t.}\qquad \rho(\Delta\boldsymbol{W})\leq \eta $$

If $\rho$ is the Frobenius Norm, then the result will be the same as in the previous section, because the F-norm is equivalent to treating the matrix as a vector and calculating its L2 norm, so the result is also SGD, treating the matrix as a vector. To obtain a result that better reveals and aligns with the essence of matrices, the norm we choose here is the Spectral Norm, also known as the “$2$-norm”, denoted as $\Vert\cdot\Vert_2$.

As for why the spectral norm is chosen, please refer to “Appreciation of Muon Optimizer: An Essential Leap from Vectors to Matrices”, “Muon Sequel: Why Do We Choose to Try Muon?”, and “Higher-Order muP: More Concise Yet Smarter Spectral Condition Scaling”; we will not repeat the introduction here. Simply put, the spectral norm is the most compact norm for revealing the changes in linear layers, making it more suitable as a measure of a matrix’s “stability”.

Following the previous steps, a first-order approximation of $\mathcal{L}(\boldsymbol{W} +\Delta\boldsymbol{W})$ yields $\mathcal{L}(\boldsymbol{W}) + \langle \boldsymbol{G}, \Delta\boldsymbol{W}\rangle_F$, where $\boldsymbol{G}=\nabla_{\boldsymbol{W}}\mathcal{L}(\boldsymbol{W})$. Here, $\langle\cdot,\cdot\rangle_F$ is the inner product calculated after flattening the two matrices into vectors, which is also equal to $\tr(\boldsymbol{G}^{\top}\Delta\boldsymbol{W})$. Then, let $\Delta\boldsymbol{W} = -\eta \boldsymbol{\Phi}$, and the original problem can be simplified to

$$ \max_{\boldsymbol{\Phi}} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{s.t.}\qquad \Vert\boldsymbol{\Phi}\Vert_2 = 1 $$

The transformation steps so far are general. If you’ve forgotten the details, please refer back to the previous article.

Basic Result
#

The solution process for the objective (1) was already given in the “Matrix Norms” section of “Appreciation of Muon Optimizer: An Essential Leap from Vectors to Matrices”, but for the sake of completeness, we will reiterate it here. Let the SVD of $\boldsymbol{G}$ be $\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top} = \sum\limits_{i=1}^r \sigma_i \boldsymbol{u}_i \boldsymbol{v}_i^{\top}$, where $r$ is the rank of $\boldsymbol{G}$. We have

$$ \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi})=\tr\left(\sum_{i=1}^r \sigma_i \boldsymbol{v}_i \boldsymbol{u}_i^{\top}\boldsymbol{\Phi}\right) = \sum_{i=1}^r \sigma_i \boldsymbol{u}_i^{\top}\boldsymbol{\Phi}\boldsymbol{v}_i $$

By definition, when $\Vert\boldsymbol{\Phi}\Vert_2=1$, $\Vert\boldsymbol{\Phi}\boldsymbol{v}_i\Vert_2\leq \Vert\boldsymbol{v}_i\Vert_2=1$, so $\boldsymbol{u}_i^{\top}\boldsymbol{\Phi}\boldsymbol{v}_i\leq 1$. Therefore,

$$ \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi})\leq \sum_{i=1}^r \sigma_i = \Vert \boldsymbol{G}\Vert_* $$

Here, $\Vert\cdot\Vert_*$ is called the Nuclear Norm of the matrix. Equality holds when all $\boldsymbol{u}_i^{\top}\boldsymbol{\Phi}\boldsymbol{v}_i$ are equal to 1, at which point

$$ \boldsymbol{\Phi} = \sum_{i=1}^r \boldsymbol{u}_i \boldsymbol{v}_i^{\top} = \boldsymbol{U}_{[:,:r]}\boldsymbol{V}_{[:,:r]}^{\top} = \msign(\boldsymbol{G}) $$

Note that if $r < m$, adding $\boldsymbol{u}_{r+1} \boldsymbol{v}_{r+1}^{\top},\boldsymbol{u}_{r+2} \boldsymbol{v}_{r+2}^{\top},\cdots$ can also make the equality hold, meaning the solution is not unique. However, terms greater than $r$ cannot be uniquely determined in this case, so the above equation can be considered a deterministic, most minimal solution. Furthermore, interested readers can try to use the “sledgehammer”—Von Neumann’s trace inequality—to find the general solution under the Schatten-$p$ norm, with the spectral norm corresponding to the special case where $p\to\infty$.

Orthogonal Manifold
#

Thus far, we have proved that for matrix parameters, the fastest descent direction under a spectral norm constraint is not simply the negative gradient $-\boldsymbol{G}$, but involves an additional $\msign$ operator, i.e., $-\msign(\boldsymbol{G})$. This is precisely the Muon optimizer used to train Kimi K2, which is one of the most competitive optimizers currently available. This, in turn, demonstrates that the spectral norm is a very appropriate stability constraint for matrices.

Of course, the results so far are still old. Now we start to tinker with something new—adding an orthogonal constraint $\boldsymbol{W}^{\top}\boldsymbol{W}=\boldsymbol{I}$ to the parameter $\boldsymbol{W}$ (Source: “Orthogonal manifold”). This falls into two cases: first, $n=m$, where $\boldsymbol{W}$ is a proper orthogonal matrix satisfying $\boldsymbol{W}^{\top}\boldsymbol{W}=\boldsymbol{W}\boldsymbol{W}^{\top}=\boldsymbol{I}$; second, $n > m$, where $\boldsymbol{W}\boldsymbol{W}^{\top}=\boldsymbol{I}$ cannot be satisfied, and it is usually called a semi-orthogonal matrix, with the corresponding space known as the Stiefel manifold.

Specifically, the problem we are now addressing is:

$$ \max_{\boldsymbol{\Phi}} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{s.t.}\qquad \Vert\boldsymbol{\Phi}\Vert_2 = 1,\,\, \boldsymbol{W}^{\top}\boldsymbol{W}=\boldsymbol{I},\,\,(\boldsymbol{W} - \eta \boldsymbol{\Phi})^{\top}(\boldsymbol{W} - \eta \boldsymbol{\Phi})=\boldsymbol{I} $$

Still adhering to the principle that “first-order approximation is sufficient”, the last condition can be simplified to $\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W} = \boldsymbol{0}$, meaning $\boldsymbol{W}^{\top}\boldsymbol{\Phi}$ is an antisymmetric matrix:

$$ \max_{\boldsymbol{\Phi}} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{s.t.}\qquad \Vert\boldsymbol{\Phi}\Vert_2 = 1,\,\, \boldsymbol{W}^{\top}\boldsymbol{W}=\boldsymbol{I},\,\,\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W} = \boldsymbol{0} $$

When would an orthogonal constraint be useful? In fact, there are many scenarios. For example, in classification problems, if it is known that there is little correlation between various categories, then conditional orthogonal constraints can be considered for the class matrix. However, often we achieve approximate orthogonality by adding a regularization term $\Vert\boldsymbol{W}^{\top}\boldsymbol{W}-\boldsymbol{I}\Vert_F^2$ to the model. Another example is in LoRA scenarios, where parameterization in the form of $\boldsymbol{A}\boldsymbol{B}$ actually has redundancy, which can be reduced through orthogonal constraints, and so on.

Solution Process
#

To solve the objective (2), similar to the previous article, we introduce a matrix of undetermined coefficients $\boldsymbol{\Lambda}\in\mathbb{R}^{m\times m}$, yielding

$$ \begin{aligned} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) =&\, \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) + \tr(\boldsymbol{\Lambda}^{\top}(\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W})) \\ =&\, \tr((\boldsymbol{G} + \boldsymbol{W}(\boldsymbol{\Lambda} + \boldsymbol{\Lambda}^{\top}))^{\top}\boldsymbol{\Phi}) \\ \leq &\, \Vert\boldsymbol{G} + \boldsymbol{W}(\boldsymbol{\Lambda} + \boldsymbol{\Lambda}^{\top})\Vert_* \end{aligned} $$

The trace identity used in the second equality is

$$ \tr(\boldsymbol{A}\boldsymbol{B}) = \tr(\boldsymbol{B}\boldsymbol{A}) = \tr(\boldsymbol{A}^{\top}\boldsymbol{B}^{\top}) = \tr(\boldsymbol{B}^{\top}\boldsymbol{A}^{\top}) $$

According to the Muon result from the previous section, the condition for equality is

$$ \boldsymbol{\Phi} = \msign(\boldsymbol{G} + \boldsymbol{W}(\boldsymbol{\Lambda} + \boldsymbol{\Lambda}^{\top})) $$

The remaining problem is to find a real symmetric matrix $\boldsymbol{X} = \boldsymbol{\Lambda} + \boldsymbol{\Lambda}^{\top}$ such that $\boldsymbol{W}^{\top}\boldsymbol{\Phi}$ is an antisymmetric matrix. This is easy to solve for $n=m$, because in this case $\boldsymbol{W}^{\top}$ can be absorbed into $\msign$:

$$ \boldsymbol{W}^{\top}\boldsymbol{\Phi} = \boldsymbol{W}^{\top}\msign(\boldsymbol{G} + \boldsymbol{W}\boldsymbol{X}) = \msign(\boldsymbol{W}^{\top}(\boldsymbol{G} + \boldsymbol{W}\boldsymbol{X})) = \msign(\boldsymbol{W}^{\top}\boldsymbol{G} +\boldsymbol{X}) $$

Note that $\msign$ also has the property of preserving antisymmetry; that is, if a square matrix $\boldsymbol{M}$ is antisymmetric, then $\msign(\boldsymbol{M})$ is also antisymmetric (please prove it). Therefore, to make $\boldsymbol{W}^{\top}\boldsymbol{\Phi}$ antisymmetric, we just need to make $\boldsymbol{W}^{\top}\boldsymbol{G} +\boldsymbol{X}$ antisymmetric. Note that $\boldsymbol{X}$ is symmetric, so this is equivalent to decomposing $\boldsymbol{W}^{\top}\boldsymbol{G}$ into the sum of a symmetric matrix and an antisymmetric matrix, which has a readily available solution:

$$ \boldsymbol{W}^{\top}\boldsymbol{G} = \underbrace{\frac{1}{2}(\boldsymbol{W}^{\top}\boldsymbol{G} + \boldsymbol{G}^{\top}\boldsymbol{W})}_{[\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{sym}}} + \underbrace{\frac{1}{2}(\boldsymbol{W}^{\top}\boldsymbol{G} - \boldsymbol{G}^{\top}\boldsymbol{W})}_{[\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{skew}}} $$

where $[\boldsymbol{M}]_{\text{sym}} = (\boldsymbol{M}+\boldsymbol{M}^{\top})/2, [\boldsymbol{M}]_{\text{skew}} = (\boldsymbol{M}-\boldsymbol{M}^{\top})/2$. Based on the above identity, we can directly derive $\boldsymbol{X} = -[\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{sym}}$. As for the solution when $n > m$, it will be more complex, and we will discuss it in detail in the next article. In this article, we will first try to fully resolve the case where $n=m$.

Retraction Operation
#

In summary, when $n=m$, the final result we obtained is

$$ \begin{aligned} \boldsymbol{\Phi} =&\, \msign(\boldsymbol{G} - \boldsymbol{W}[\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{sym}}) \\ =&\, \boldsymbol{W}\boldsymbol{W}^{\top}\msign(\boldsymbol{G} - \boldsymbol{W}[\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{sym}}) \\ =&\, \boldsymbol{W}\msign(\boldsymbol{W}^{\top}\boldsymbol{G} - [\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{sym}}) \\ =&\, \boldsymbol{W}\msign([\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{skew}}) \\ \end{aligned} $$

Thus, the new variable is

$$ \boldsymbol{W} - \eta \boldsymbol{\Phi} = \boldsymbol{W}(\boldsymbol{I} - \eta\,\underbrace{\msign([\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{skew}})}_{\text{denoted as}\boldsymbol{O}}) $$

It is not an orthogonal matrix, but it is accurate to $\mathcal{O}(\eta^2)$, which aligns with our “first-order approximation is sufficient” principle. To observe this, one just needs to verify

$$ \begin{aligned} (\boldsymbol{I} - \eta\boldsymbol{O})^{\top}\boldsymbol{W}^{\top}\boldsymbol{W}(\boldsymbol{I} - \eta\boldsymbol{O}) =&\,(\boldsymbol{I} - \eta\boldsymbol{O})^{\top}(\boldsymbol{I} - \eta\boldsymbol{O}) \\ =&\, \boldsymbol{I} - \eta(\boldsymbol{O}^{\top} + \boldsymbol{O}) + \eta^2\boldsymbol{O}^{\top}\boldsymbol{O} \\ =&\, \boldsymbol{I} + \eta^2\boldsymbol{O}^{\top}\boldsymbol{O} \\ \end{aligned} $$

If $[\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{skew}}$ is full rank, then $\boldsymbol{O}$ is an orthogonal matrix, i.e., $\boldsymbol{O}^{\top}\boldsymbol{O}=\boldsymbol{I}$. In this case, by simply dividing by $\sqrt{1+\eta^2}$, we can make the expression (3) satisfy orthogonality. However, when $[\boldsymbol{W}^{\top}\boldsymbol{G}]_{\text{skew}}$ is not full rank, there is no simple transformation to make it satisfy orthogonality. In this situation, the general approach is to find the closest orthogonal matrix, which is precisely what $\msign$ does (refer to here)! Therefore, the complete update rule is

$$ \boldsymbol{W} \quad \leftarrow\quad \msign(\boldsymbol{W} - \eta \boldsymbol{\Phi}) = \msign(\boldsymbol{W}(\boldsymbol{I} - \eta\boldsymbol{O})) = \boldsymbol{W}\msign(\boldsymbol{I} - \eta\boldsymbol{O}) $$

However, this requires calculating $\msign$ twice, so we will try to simplify it. According to the definition and equation (4), we have

$$ \msign(\boldsymbol{I} - \eta\boldsymbol{O}) = (\boldsymbol{I} - \eta\boldsymbol{O})(\boldsymbol{I} + \eta^2\boldsymbol{O}^{\top}\boldsymbol{O})^{-1/2} $$

Note that regardless of full rank or not, $(\boldsymbol{O}^{\top}\boldsymbol{O})^2 = \boldsymbol{O}^{\top}\boldsymbol{O}$. Let $(1+\eta^2 x)^{-1/2}=1 + a_1 x + a_2 x^2 + a_2 x^3 + \cdots $, then

$$ \begin{aligned} (\boldsymbol{I} + \eta^2\boldsymbol{O}^{\top}\boldsymbol{O})^{-1/2} =&\, \boldsymbol{I} + a_1 (\boldsymbol{O}^{\top}\boldsymbol{O}) + a_2 (\boldsymbol{O}^{\top}\boldsymbol{O})^2 + a_3 (\boldsymbol{O}^{\top}\boldsymbol{O})^3 + \cdots \\ =&\, \boldsymbol{I} + a_1 (\boldsymbol{O}^{\top}\boldsymbol{O}) + a_2 (\boldsymbol{O}^{\top}\boldsymbol{O}) + a_3 (\boldsymbol{O}^{\top}\boldsymbol{O}) + \cdots \\ =&\, \boldsymbol{I} - \boldsymbol{O}^{\top}\boldsymbol{O} + \underbrace{(1 + a_1 + a_2 + a_3 + \cdots)}_{(1+\eta^2 x)^{-1/2}\text{substituted with }x=1}\boldsymbol{O}^{\top}\boldsymbol{O} \\ \end{aligned} $$

This eliminates one $\msign$ operation. The simplified complete result is

$$ \boldsymbol{W} \quad \leftarrow\quad \boldsymbol{W}(\boldsymbol{I} - \eta\boldsymbol{O})\left(\boldsymbol{I} - \boldsymbol{O}^{\top}\boldsymbol{O} + \frac{\boldsymbol{O}^{\top}\boldsymbol{O}}{\sqrt{1+\eta^2}}\right) $$

Summary (formatted)
#

In this article, we revisited the conclusion that adding a spectral norm constraint to the update amount of matrix parameters yields the Muon optimizer, and then explored the form of the Muon optimizer after adding an orthogonal constraint to the parameters. If you wish for parameters to remain orthogonal matrices during updates, then this article may be of some reference value.

@online{kexuefm-11215,
        title={Steepest Descent on Manifolds: 2. Muon + Orthogonal},
        author={苏剑林},
        year={2025},
        month={08},
        url={\url{https://kexue.fm/archives/11215}},
}

Problem Description#

Basic Result#

Orthogonal Manifold#

Solution Process#

Retraction Operation#

Summary (formatted)#