This is a gemini-2.5-flash translation of a Chinese article.
It has NOT been vetted for errors. You should have the original article open in a parallel tab at all times.
In the previous four articles, we solved several specific steepest descent problems with equality constraints on parameters. The problems in the third and fourth articles could not be solved analytically, so the author proposed corresponding fixed-point iteration methods. Specifically, the “Muon on Stiefel manifold” studied in the third article, “Steepest Descent on Manifolds: 3. Muon + Stiefel”, was a problem originally proposed by Jeremy Bernstein in his article “Orthogonal manifold”.
For this problem, Jeremy Bernstein also eventually provided his own solution, which the author calls “Dual Gradient Descent,” and it is quite worthwhile to study.
Basic Concepts#
Jeremy Bernstein’s solution was eventually published in Thinking Machines Lab’s blog “Modular Manifolds”. It was the second blog post from that lab, and in the article, it was called “Dual Ascent.” However, here, in conjunction with the content of the previous four articles, the author will refer to it as “Dual Gradient Descent.”
In fact, Dual Gradient Descent can be said to be a natural consequence of the Lagrange multiplier method. However, a rigorous discussion of the Lagrange multiplier method is quite cumbersome; for example, it requires introducing the Minimax theorem. So, in this series, to avoid these complications, we adopted a “method of undetermined coefficients” approach for derivation, which makes Dual Gradient Descent seem less natural. But no matter, we can still derive it following our line of thought; it just might take a bit more space.
First, let’s review the various notations. $\boldsymbol{W}\in\mathbb{R}^{n\times m}$ is a matrix parameter. Without loss of generality, assume $n\geq m$. $\boldsymbol{G}\in\mathbb{R}^{n\times m}$ is its gradient. $\Vert\boldsymbol{G}\Vert_2$ is the spectral norm of matrix $\boldsymbol{G}$, which equals its largest singular value; $\Vert\boldsymbol{G}\Vert_*$ is the nuclear norm of matrix $\boldsymbol{G}$, which equals the sum of all its singular values. In particular, according to the conclusion of the article “Derivatives of SVD”, we have
$$ \nabla_{\boldsymbol{G}}\Vert\boldsymbol{G}\Vert_* = \sum_i \nabla_{\boldsymbol{G}} \sigma_i = \sum_i \boldsymbol{u}_i \boldsymbol{v}_i^{\top} = \boldsymbol{U}\boldsymbol{V}^{\top} = \msign(\boldsymbol{G}) $$where $\boldsymbol{G}=\sum_i \sigma_i \boldsymbol{u}_i \boldsymbol{v}_i^{\top} = \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top}$ is the SVD of $\boldsymbol{G}$. In other words, the gradient of the nuclear norm is exactly the $\msign$ operator, which is an important foundation for the subsequent derivation.
Problem Description#
We will continue to introduce Dual Gradient Descent following our previous derivation approach, so this section will first restate the problems and existing results.
In “Steepest Descent on Manifolds: 3. Muon + Stiefel”, the problem we aimed to solve was
$$ \mathop{\text{tr}}\max_{\boldsymbol{\Phi}} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{s.t.}\qquad \Vert\boldsymbol{\Phi}\Vert_2 = 1,\,\, \boldsymbol{W}^{\top}\boldsymbol{W}=\boldsymbol{I},\,\,\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W} = \boldsymbol{0} $$The solution is $\boldsymbol{\Phi} = \msign(\boldsymbol{G} + \boldsymbol{W}\boldsymbol{X})$, where $\boldsymbol{X}\in\mathbb{R}^{m\times m}$ is an undetermined symmetric matrix such that $\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W} = \boldsymbol{0}$.
In “Steepest Descent on Manifolds: 4. Muon + Spectral Sphere”, the problem we sought to solve was
$$ \max_{\boldsymbol{\Phi}} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{s.t.}\qquad \Vert\boldsymbol{\Phi}\Vert_2 = 1,\,\, \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0 $$The answer is $\boldsymbol{\Phi} = \msign(\boldsymbol{G} + \lambda\boldsymbol{\Theta})$, where $\lambda$ is an undetermined coefficient such that $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0$.
As can be seen, our ultimate task has become finding undetermined coefficients such that they satisfy the additionally introduced equality constraints, which is essentially solving a system of nonlinear equations. Dual Gradient Descent, on the other hand, transforms the problem of solving equations into minimizing a certain objective function, thereby solving it using gradient descent.
Dual Objective#
The key to the transformation is the gradient identity of the nuclear norm. For simplicity, let’s first look at the “Muon + Spectral Sphere” problem, where the undetermined coefficient is just a scalar, making it easier to observe. It is not difficult to verify
$$ \nabla_{\lambda} \Vert\boldsymbol{G} + \lambda\boldsymbol{\Theta}\Vert_* = \tr(\boldsymbol{\Theta}^{\top}\msign(\boldsymbol{G} + \lambda\boldsymbol{\Theta})) = \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi}) $$This means that solving the equation $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0$ is equivalent to finding a point where the gradient of $\Vert\boldsymbol{G} + \lambda\boldsymbol{\Theta}\Vert_*$ equals 0, which could be its (local) minimum/maximum point. Since $\Vert\boldsymbol{G} + \lambda\boldsymbol{\Theta}\Vert_*$ obviously has no maximum value, we convert it to finding its minimum point:
$$ \lambda^* = \mathop{\text{argmin}}\argmin_{\lambda} \Vert\boldsymbol{G} + \lambda\boldsymbol{\Theta}\Vert_* $$Let’s review the steps here again:
- Our goal is to solve the equation $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0$; finding any one solution is sufficient.
- $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})$ is exactly the gradient of $\Vert\boldsymbol{G} + \lambda\boldsymbol{\Theta}\Vert_*$ with respect to $\lambda$.
- This translates into finding a (local) minimum/maximum point problem, because the gradient at these points is often 0.
- It can be simply judged that there is no maximum value, so only the minimum can be sought.
Gradient Descent#
After determining the objective, we can solve it using gradient descent, where the gradient is readily available, i.e., $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})$. Then the format for gradient descent is
$$ \lambda \quad \leftarrow\quad \lambda - \eta \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi}) $$Of course, we can also consider adding a $\mathop{\text{sign}}\sign$ to $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})$, i.e., SignSGD; these can be freely explored. From the perspective of the iteration format, Dual Gradient Descent is much simpler than the fixed-point iteration we proposed earlier. However, in many cases, Dual Gradient Descent requires many more iteration steps and may require fine-tuning the learning rate, introducing momentum mechanisms, etc., to converge.
Therefore, as far as solving the equation $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0$ is concerned, Dual Gradient Descent is not a particularly ideal solution. However, our ultimate goal is not to solve the equation $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0$, but to calculate $\boldsymbol{\Phi}$ as the optimization direction for the model. Model optimization is inherently an iterative process; we can cache the historical $\lambda$ values and then adopt an approximate strategy where $\lambda$ is updated synchronously with the model parameters:
$$ \boldsymbol{\Phi} = \msign(\boldsymbol{G} + \lambda\boldsymbol{\Theta}), \quad \boldsymbol{W}\leftarrow\boldsymbol{W}- \eta_1 \boldsymbol{\Phi},\quad \lambda \leftarrow\lambda - \eta_2 \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi}) $$This way, each training step only requires one additional, almost free calculation of $\lambda - \eta_2 \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})$, and an approximate implementation of the original objective is obtained. Formally, it is equivalent to an adaptive Weight Decay for Muon.
Stiefel#
After discussing the relatively simpler “Muon + Spectral Sphere,” let’s now look at “Muon + Stiefel,” which is objective. At this point, the undetermined matrix $\boldsymbol{X}$ has the constraint $\boldsymbol{X}=\boldsymbol{X}^{\top}$. We remove the constraint by setting $\boldsymbol{X}=\boldsymbol{\Lambda}+\boldsymbol{\Lambda}^{\top}$, where $\boldsymbol{\Lambda}\in\mathbb{R}^{m\times m}$ is an arbitrary matrix. Then it can be found that
$$ \nabla_{\boldsymbol{\Lambda}}\Vert\boldsymbol{G} + \boldsymbol{W}\boldsymbol{X}\Vert_* = \boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W} $$Here $\boldsymbol{\Phi} = \msign(\boldsymbol{G} + \boldsymbol{W}\boldsymbol{X})$. Therefore, solving the system of equations $\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W}=\boldsymbol{0}$ can also be transformed into finding the minimum point of the function $\Vert\boldsymbol{G} + \boldsymbol{W}\boldsymbol{X}\Vert_*$, and then solving it with gradient descent:
$$ \boldsymbol{\Lambda} \quad\leftarrow\quad \boldsymbol{\Lambda} - \eta(\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W}) $$Since $\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W}$ is necessarily symmetric, directly using $\boldsymbol{X} \leftarrow\boldsymbol{X} - \eta(\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W})$ is also acceptable. By iterating it synchronously with $\boldsymbol{W}$, we get:
$$ \boldsymbol{\Phi} = \msign(\boldsymbol{G} + \boldsymbol{W}\boldsymbol{X}), \quad \boldsymbol{W}\leftarrow\boldsymbol{W}- \eta_1 \boldsymbol{\Phi},\quad \boldsymbol{X} \leftarrow\boldsymbol{X} - \eta_2(\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W}) $$This achieves an approximation of objective, and the additional $\boldsymbol{X} - \eta_2(\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W})$ in each step is almost free.
Lagrange Multipliers#
In these two examples, the equations they needed to solve both happened to be equal to the gradient of a certain nuclear norm objective. Is this merely a coincidence? Of course not. As we mentioned in the “Basic Concepts” section, this is a natural result of the Lagrange multiplier method. In this section, we will elaborate on this.
For easier understanding, let’s still take the relatively simple objective as an example. It can be equivalently written as
$$ \max_{\Vert\boldsymbol{\Phi}\Vert_2\leq 1} \min_{\lambda\in\mathbb{R}}\tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) + \lambda\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi}) $$To understand this transformation, one only needs to realize that the above expression must have $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0$; otherwise, the min step can always reach negative infinity, and then the final max result can only be negative infinity. As for changing $\Vert\boldsymbol{\Phi}\Vert_2 = 1$ to $\Vert\boldsymbol{\Phi}\Vert_2\leq 1$, it does not change the maximum value’s result (because the maximum value is always achieved at the boundary), but it can make the feasible region of $\boldsymbol{\Phi}$ a convex set.
With this equivalent form, we can use the Minimax theorem to swap the positions of $\min$ and $\max$:
$$ \begin{aligned} &\,\max_{\Vert\boldsymbol{\Phi}\Vert_2\leq 1} \min_{\lambda\in\mathbb{R}}\tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) + \lambda\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi}) \\ =&\, \min_{\lambda\in\mathbb{R}}\max_{\Vert\boldsymbol{\Phi}\Vert_2\leq 1}\tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) + \lambda\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi}) \\ =&\, \min_{\lambda\in\mathbb{R}} \Vert\boldsymbol{G} + \lambda \boldsymbol{\Theta}\Vert_* \end{aligned} $$The step of taking the maximum over $\Vert\boldsymbol{\Phi}\Vert_2\leq 1$ is a basic result derived for Muon, so there is no difficulty in first finding the maximum. This way, we obtain the dual objective $\Vert\boldsymbol{G} + \lambda \boldsymbol{\Theta}\Vert_*$ for the original problem.
Some readers might wonder, “How is your Lagrange multiplier method different from what I learned?” This is because the Lagrange multiplier method here is generalized to general convex sets and strictly discusses the interchangeability of $\min$ and $\max$ to ensure that the final result is what we desire. In contrast, the Lagrange multiplier method we generally learn is just a set of heuristic solution procedures for constrained optimization problems in $\mathbb{R}^n$ and does not delve much into the details of theoretical guarantees.
Summary (formatted)#
In this article, we introduced the idea of finding the steepest descent direction on manifolds using Dual Gradient Descent. It is also the method used by Thinking Machines Lab’s blog “Modular Manifolds” a while ago to solve the Muon on Stiefel manifold problem.
@online{kexuefm-11388,
title={Steepest Descent on Manifolds: 5. Dual Gradient Descent},
author={苏剑林},
year={2025},
month={11},
url={\url{https://kexue.fm/archives/11388}},
}