This is a gemini-2.5-flash
translation of a Chinese article.
It has NOT been vetted for errors. You should have the original article open in a parallel tab at all times.
By Su Jianlin | 2025-09-15 | 1606 Readers
In the previous two articles, 《Rethinking Learning Rate and Batch Size (I): Current Status》 and 《Rethinking Learning Rate and Batch Size (II): Mean Field》, we primarily proposed the mean field method to simplify calculations related to the learning rate and batch size. The optimizers we analyzed at the time were SGD, SignSGD, and SoftSignSGD, and the main goal was simplification, essentially yielding no new conclusions.
However, in today’s feast of optimizers, how can Muon be absent? Therefore, in this article, we will try to calculate the relevant conclusions for Muon and see if its relationship between learning rate and batch size presents new patterns.
Basic Notations#
As is well known, the main characteristic of Muon is its non-element-wise update rule. Therefore, the element-wise calculation methods previously used in 《How Should the Learning Rate Change as Batch Size Increases?》 and 《How Does Adam’s epsilon Affect the Learning Rate Scaling Law?》 will be completely unusable. Fortunately, the mean field introduced in the previous article is still effective, requiring only minor adjustments.
Let’s first introduce some notations. Let the loss function be $\mathcal{L}(\boldsymbol{W})$, where $\boldsymbol{W}\in\mathbb{R}^{n\times m}$ is a matrix vector (assume $n\geq m$). $\boldsymbol{G}$ is its gradient. The gradient for a single sample is denoted as $\tilde{\boldsymbol{G}}$, its mean is $\boldsymbol{G}$, and its variance is $\sigma^2$. When the batch size is $B$, the gradient is denoted as $\tilde{\boldsymbol{G}}_B$, its mean is still $\boldsymbol{G}$, but its variance becomes $\sigma^2/B$. Note that the variance here is just a scalar $\sigma^2$, unlike before where we considered the full covariance matrix.
The core reason for this simplification is that the random variable itself is already a matrix, so its corresponding covariance matrix is actually a 4th-order tensor, which is cumbersome to discuss. Would simplifying it to a single scalar severely lose accuracy? Actually, no. Although we considered the full covariance matrix $\boldsymbol{\Sigma}$ in the previous two articles, a careful observation reveals that the final result only depends on $\tr(\boldsymbol{\Sigma})$, which is equivalent to simplifying it to a scalar from the beginning.
Hessian Matrix#
Similarly, let the update quantity be $-\eta\tilde{\boldsymbol{\Phi}}_B$, and consider the second-order expansion of the loss function:
$$ \mathcal{L}(\boldsymbol{W} - \eta\tilde{\boldsymbol{\Phi}}_B) \approx \mathcal{L}(\boldsymbol{W}) - \eta \tr(\tilde{\boldsymbol{\Phi}}{}_B^{\top}\boldsymbol{G}) + \frac{1}{2}\eta^2\tr(\tilde{\boldsymbol{\Phi}}{}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{\Phi}}_B) $$The first two terms should be straightforward, while the third term is more difficult to understand. Similar to the covariance matrix, the Hessian matrix $\boldsymbol{H}$ here is a 4th-order tensor, which is cumbersome to interpret.
The simplest way to approach this is from a linear operator perspective, i.e., understanding $\boldsymbol{H}$ as a linear operator where both input and output are matrices. We don’t need to know what $\boldsymbol{H}$ looks like, nor how $\boldsymbol{H}$ operates with $\tilde{\boldsymbol{\Phi}}_B$; we only need to know that $\boldsymbol{H}\tilde{\boldsymbol{\Phi}}_B$ is linear with respect to $\tilde{\boldsymbol{\Phi}}_B$. This way, the objects we deal with are still matrices, without increasing cognitive load. Any qualified linear operator can serve as an approximation of the Hessian matrix, without needing to write out the specific higher-order tensor form.
The protagonist of this article is Muon. We take $\tilde{\boldsymbol{\Phi}}_B=\msign(\tilde{\boldsymbol{G}}_B)$ as its approximation for calculation. By definition, we write $\msign(\tilde{\boldsymbol{G}}_B)=\tilde{\boldsymbol{G}}_B(\tilde{\boldsymbol{G}}{}_B^{\top}\tilde{\boldsymbol{G}}_B)^{-1/2}$. From a Newton’s method perspective, this is equivalent to assuming $\boldsymbol{H}^{-1}\boldsymbol{X} = \eta_{\max}\boldsymbol{X}(\boldsymbol{G}^{\top}\boldsymbol{G})^{-1/2}$, which implies $\boldsymbol{H}\boldsymbol{X} = \eta_{\max}^{-1}\boldsymbol{X}(\boldsymbol{G}^{\top}\boldsymbol{G})^{1/2}$. This will be used in subsequent calculations.
Calculating Expectation#
Taking the expectation of both sides of the equation, we get:
$$ \mathbb{E}[\mathcal{L}(\boldsymbol{W} - \eta\tilde{\boldsymbol{\Phi}}_B)] \approx \mathcal{L}(\boldsymbol{W}) - \eta \tr(\mathbb{E}[\tilde{\boldsymbol{\Phi}}_B]^{\top}\boldsymbol{G}) + \frac{1}{2}\eta^2\mathbb{E}[\tr(\tilde{\boldsymbol{\Phi}}{}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{\Phi}}_B)] $$First, let’s calculate $\mathbb{E}[\tilde{\boldsymbol{\Phi}}_B]$:
$$ \mathbb{E}[\tilde{\boldsymbol{\Phi}}_B]=\mathbb{E}[\tilde{\boldsymbol{G}}_B(\tilde{\boldsymbol{G}}{}_B^{\top}\tilde{\boldsymbol{G}}_B)^{-1/2}]\approx\mathbb{E}[\tilde{\boldsymbol{G}}_B](\mathbb{E}[\tilde{\boldsymbol{G}}{}_B^{\top}\tilde{\boldsymbol{G}}_B])^{-1/2} = \boldsymbol{G}(\mathbb{E}[\tilde{\boldsymbol{G}}{}_B^{\top}\tilde{\boldsymbol{G}}_B])^{-1/2} $$Let’s write out $\mathbb{E}[\tilde{\boldsymbol{G}}{}_B^{\top}\tilde{\boldsymbol{G}}_B]$ component-wise, assuming independence between different components:
$$ \mathbb{E}[\tilde{\boldsymbol{G}}{}_B^{\top}\tilde{\boldsymbol{G}}_B]_{i,j} = \mathbb{E}\left[\sum_{k=1}^n (\tilde{G}_B)_{k,i}(\tilde{G}_B)_{k,j}\right] = \left\{\begin{aligned} \mathbb{E}\left[\sum_{k=1}^n (\tilde{G}_B)_{k,i}^2\right] = \left(\sum_{k=1}^n G_{k,i}^2\right) + n\sigma^2/B,\quad (i=j) \\[6pt] \sum_{k=1}^n \mathbb{E}[(\tilde{G}_B)_{k,i}] \mathbb{E}[(\tilde{G}_B)_{k,j}] = \sum_{k=1}^n G_{k,i}G_{k,j},\quad (i\neq j) \end{aligned}\right. $$Combining them, we get $\mathbb{E}[\tilde{\boldsymbol{G}}{}_B^{\top}\tilde{\boldsymbol{G}}_B]=\boldsymbol{G}^{\top}\boldsymbol{G} + (n\sigma^2/B) \boldsymbol{I}$, thus:
$$ \mathbb{E}[\tilde{\boldsymbol{\Phi}}_B]\approx \boldsymbol{G}(\boldsymbol{G}^{\top}\boldsymbol{G} + (n\sigma^2/B) \boldsymbol{I})^{-1/2} = \msign(\boldsymbol{G})(\boldsymbol{I} + (n\sigma^2/B) (\boldsymbol{G}^{\top}\boldsymbol{G})^{-1})^{-1/2} $$To further simplify the dependency on $B$, we approximate $\boldsymbol{G}^{\top}\boldsymbol{G}$ with $\tr(\boldsymbol{G}^{\top}\boldsymbol{G})\boldsymbol{I}/m$, meaning we only retain the diagonal part of $\boldsymbol{G}^{\top}\boldsymbol{G}$ and then replace the diagonal elements with their average. This way, we obtain:
$$ \mathbb{E}[\tilde{\boldsymbol{\Phi}}_B]\approx \msign(\boldsymbol{G})(1 + \mathcal{B}_{\text{simple}}/B)^{-1/2} $$where $\mathcal{B}_{\text{simple}} = mn\sigma^2/\tr(\boldsymbol{G}^{\top}\boldsymbol{G})= mn\sigma^2/\Vert\boldsymbol{G}\Vert_F$. This is actually the same as treating $\boldsymbol{G}$ as a vector and calculating $\mathcal{B}_{\text{simple}}$ from the previous two articles. The above equation is identical to that of SignSGD, from which we can guess that Muon will not yield many new results regarding the relationship between learning rate and batch size.
Same Pattern#
As for $\mathbb{E}[\tr(\tilde{\boldsymbol{\Phi}}{}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{\Phi}}_B)]$, we only calculate it based on the assumption corresponding to Muon, which we just derived, i.e., $\boldsymbol{H}\boldsymbol{X} = \eta_{\max}^{-1}\boldsymbol{X}(\boldsymbol{G}^{\top}\boldsymbol{G})^{1/2}$. Then:
$$ \tr(\tilde{\boldsymbol{\Phi}}{}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{\Phi}}_B) = \eta_{\max}^{-1}\tr(\tilde{\boldsymbol{\Phi}}{}_B^{\top}\tilde{\boldsymbol{\Phi}}_B(\boldsymbol{G}^{\top}\boldsymbol{G})^{1/2}) $$Noting that $\tilde{\boldsymbol{\Phi}}_B$ is the result of $\msign$, it must be an orthogonal matrix (full rank), so $\tilde{\boldsymbol{\Phi}}{}_B^{\top}\tilde{\boldsymbol{\Phi}}_B=\boldsymbol{I}$. In this case, $\tr(\tilde{\boldsymbol{\Phi}}{}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{\Phi}}_B)$ is a definite constant $\eta_{\max}^{-1}\tr((\boldsymbol{G}^{\top}\boldsymbol{G})^{1/2})=\eta_{\max}^{-1}\msign(\boldsymbol{G})^{\top}\boldsymbol{G}$. Thus, we can obtain:
$$ \eta^* \approx \frac{\tr(\mathbb{E}[\tilde{\boldsymbol{\Phi}}_B]^{\top}\boldsymbol{G})}{\mathbb{E}[\tr(\tilde{\boldsymbol{\Phi}}{}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{\Phi}}_B)]}\approx \frac{\eta_{\max}}{\sqrt{1 + \mathcal{B}_{\text{simple}}/B}} $$As expected, the form is exactly the same as the SignSGD result, with no new patterns.
In fact, a closer look reveals that this is reasonable, because SignSGD directly applies $\sign$ to the gradient, while Muon’s $\msign$ applies $\sign$ to singular values. Intuitively, this is equivalent to applying $\sign$ in a different coordinate system. It introduces a new matrix update rule, but the learning rate $\eta^*$ and batch size $B$ are just scalars. Given that the core operation behind both is $\sign$, the asymptotic relationship between these scalars is highly unlikely to show significant changes.
Of course, we have only calculated a special $\boldsymbol{H}$ so far. If a more general $\boldsymbol{H}$ is considered, it is possible that the “Surge” phenomenon—where increasing batch size leads to a decrease in learning rate—might also appear, similar to SignSGD. However, as we stated in the “Reflections on the Causes” section of the previous article, if the Surge phenomenon is indeed observed, it might be more appropriate to change the optimizer rather than simply adjusting the relationship between $\eta^*$ and $B$.
Summary (formatted)#
In this article, we attempted a simple analysis of Muon using the mean field approximation. The conclusion is that its relationship between learning rate and batch size is consistent with SignSGD, presenting no new patterns.
@online{kexuefm-11285,
title={Rethinking Learning Rate and Batch Size (III) - Muon},
author={苏剑林},
year={2025},
month={09},
url={\url{https://kexue.fm/archives/11285}},
}