This is a gemini-2.5-flash translation of a Chinese article.
It has NOT been vetted for errors. You should have the original article open in a parallel tab at all times.
Recently, I believe many readers have come across news about the Muon optimizer. In fact, Muon was first proposed around October last year by Keller Jordan on Twitter, just over a year ago. However, within this year, Muon has already been tested in training models with tens of billions, hundreds of billions, and even trillions of parameters, which is enough to show it is a very competitive optimizer.
Nowadays, Muon has been built into training frameworks like Torch and Keras. Even large frameworks like Megatron are gradually starting to support it, which means it has gained widespread industry recognition. However, for readers only familiar with Adam, how to quickly and effectively switch to Muon might still be a confusing matter. Therefore, this article attempts to provide a quick start guide.
Brief Introduction#
The formal proposer of Muon is Keller Jordan, currently working at OpenAI. As mentioned at the beginning, Muon was first published on Twitter, and to this day, the author has only written a blog post “Muon: An optimizer for hidden layers in neural networks” instead of a paper. The author’s view is that “whether or not it’s written as a paper has no bearing on whether or not an optimizer is effective” [original text].
Muon is an optimizer specifically designed for matrix parameters. There are also some related works with similar characteristics, such as Shampoo, and earlier ones like Stochastic Spectral Descent, among others. Many works can be more or less associated with Muon, but none fully encompass Muon, so in my opinion, Muon is a completely new work.
In China, the earliest article to popularize Muon was likely my blog post “Appreciation of Muon Optimizer: An Essential Leap from Vectors to Matrices”. The first large-scale model to validate Muon was probably our Moonlight, released in February, which proposed the Moonlight version of Muon, later used in the trillion-parameter K2. After K2, GLM-4.5 also used this Muon variant.
As Jeremy Bernstein, one of Muon’s authors, stated in his blog post “Deriving Muon”, for me, Muon’s uniqueness lies in its derivation from more fundamental optimization principles, proving effective in practice. In contrast, while Adam is also effective, it resembles more of a heuristic approach.
Four Versions#
This article does not intend to delve into the mathematical details of Muon or its implementation, but rather to primarily introduce some technical details and considerations when switching from Adam to Muon. As mentioned earlier, Muon is specifically designed for optimizing matrix parameters, and its update rule is non-element-wise, which might be confusing for new users.
Furthermore, to my knowledge, Muon currently has at least four slightly different versions, and this multi-version phenomenon contributes to the confusion. If users do not understand these details, they might get poor results due to incorrect hyperparameter tuning (especially the learning rate). The following will clarify these points. First, for a matrix $\boldsymbol{W}\in\mathbb{R}^{d_{in}\times d_{out}}$, where $\boldsymbol{G}$ is its gradient, the four Muon variants are:
$$ \begin{align} \boldsymbol{M}_t &= \beta \boldsymbol{M}_{t-1} + \boldsymbol{G}_t \\[7pt] \boldsymbol{W}_t &= \boldsymbol{W}_{t-1} - \eta_t \left(\msign(\boldsymbol{M}_t) + \lambda \boldsymbol{W}_{t-1}\right) \quad &\color{skyblue}{(\text{Naive Version})} \\[5pt] \boldsymbol{W}_t &= \boldsymbol{W}_{t-1} - \eta_t \left(\sqrt{\max(1, d_{out}/d_{in})}\msign(\boldsymbol{M}_t) + \lambda \boldsymbol{W}_{t-1}\right) \quad &\color{skyblue}{(\text{KellerJordan Version})} \\[5pt] \boldsymbol{W}_t &= \boldsymbol{W}_{t-1} - \eta_t \left(\sqrt{ d_{out}/d_{in}}\msign(\boldsymbol{M}_t) + \lambda \boldsymbol{W}_{t-1}\right) \quad &\color{skyblue}{(\text{MuP Version})} \\[5pt] \boldsymbol{W}_t &= \boldsymbol{W}_{t-1} - \eta_t \left(0.2\times\sqrt{\max(d_{out},d_{in})}\msign(\boldsymbol{M}_t) + \lambda \boldsymbol{W}_{t-1}\right) \quad &\color{skyblue}{(\text{Moonlight Version})} \end{align} $$To enable Nesterov momentum, replace $\msign(\boldsymbol{M}_t)$ with $\msign(\beta\boldsymbol{M}_t + \boldsymbol{G}_t)$. In implementations, $\msign$ is usually named zeropower_via_newtonschulz; typical users can disregard the specific implementation details.
The only difference among the four versions is the scaling factor before $\msign$. The “KellerJordan Version” and “MuP Version” are largely similar, while the “Moonlight Version” is slightly more unique. Keras only implements the “KellerJordan Version,” while Torch implements both the “KellerJordan Version” and the “Moonlight Version.” The Naive Version seems to be less common currently. For me, I frequently use my own “MuP Version”.
Two Dimensions#
Here, we need to pay attention to an important detail: the “KellerJordan Version” and “MuP Version” are sensitive to the order of $d_{in}, d_{out}$. Therefore, the first thing is to understand the meaning of $d_{in}, d_{out}$, which does not necessarily mean that the first dimension of a matrix is always $d_{in}$ and the second dimension is always $d_{out}$.
The meanings of $d_{in}$ and $d_{out}$ are the input and output dimensions of a linear layer, respectively. So, which one is $d_{in}$ and which is $d_{out}$ depends on the specific implementation of the linear layer. For example, Keras’s Dense layer is implemented as $\boldsymbol{x}\boldsymbol{W}$, so the first dimension of matrix $\boldsymbol{W}$ is $d_{in}$ and the second dimension is $d_{out}$. However, Torch’s Linear layer is implemented as $\boldsymbol{x}\boldsymbol{W}^{\top}$, so the second dimension of matrix $\boldsymbol{W}$ is $d_{in}$ and the first dimension is $d_{out}$.
Therefore, to implement the “KellerJordan Version” of Muon, for Torch’s Linear layer, the scaling factor should be max(1, W.shape[0]/W.shape[1])**0.5, while for Keras, it should be max(1, W.shape[1]/W.shape[0])**0.5. Thus, Keras’s current Muon implementation is actually incorrect because it directly copied Torch’s scaling factor implementation (source code).
If you write your own model, you need to judge carefully based on your coding style. For instance, it’s not impossible to mix Torch’s built-in Linear layer with custom x @ W implementations, in which case you cannot generalize whether it’s W.shape[0]/W.shape[1] or W.shape[1]/W.shape[0]. Of course, if you find it troublesome to figure these out, you can consider using the “Moonlight Version,” whose scaling factor is symmetric with respect to $d_{in}, d_{out}$.
Hyperparameter Settings#
After understanding $d_{in}, d_{out}$, the remaining task is to determine how to set the learning rate $\eta_t$ and weight decay coefficient $\lambda$. The assumption here is that users already have experience tuning Adam, have achieved good results with Adam, and wish to quickly migrate to Muon for a try.
Let’s first look at the “Moonlight Version.” Its scaling factor is derived by aligning with Adam’s Update RMS. For details, you can refer to “Muon Sequel: Why Did We Choose to Try Muon?”. As for the “Magic Number” $0.2$, you can refer to “Why is Adam’s Update RMS 0.2?”. Simply put, the “Moonlight Version” of Muon aligns with Adam’s update magnitude, so the simplest way to migrate from Adam is: no changes needed, just reuse Adam’s $\eta_t$ and $\lambda$.
Next, let’s look at the remaining three versions. We know that mainstream models typically have a hidden_size (denoted as $d$), and the matrix shapes of models usually do not significantly deviate from $d\times d$. Therefore, we approximate by setting $d_{in}=d_{out}=d$. In this case, these three versions are identical, and they lack the $0.2\sqrt{d}$ term compared to the “Moonlight Version.” Since the “Moonlight Version” aligns with Adam’s update magnitude without needing hyperparameter changes, the learning rates for these three versions should be scaled up by $0.2\sqrt{d}$ to match Adam’s update magnitude. Correspondingly, $\lambda$ should be divided by $0.2\sqrt{d}$.
Substituting $d=1024, 2048, 4096$, the results are $6.4, 9, 12.8$ respectively. If you can’t remember $0.2\sqrt{d}$, you can simply remember that if we use the other three versions of Muon, then directly scale Adam’s learning rate by 10x to serve as Muon’s learning rate. If Adam’s learning rate is directly applied to Muon, Muon will appear to perform much worse than Adam due to underfitting. To my knowledge, some negative reviews of Muon stem from this.
So, does this mean the “Moonlight Version” is better to use? The “Moonlight Version” indeed yields good practical results, but to say it’s simply “better” is to evaluate it from Adam’s perspective. The advantage of the “MuP Version” or “KellerJordan Version” is that the learning rate is transferable; that is, after tuning the learning rate on a small model, it often works well when directly applied to a large model. For more on this, you can refer to Jeremy Bernstein’s blog post “Deriving Muon” or my blog post “Advanced MuP: Simpler yet More Sophisticated Spectral Conditional Scaling”.
Other Parameters#
If Muon only handles matrix parameters, what about other parameters? For example, the bias term of linear layers, the gamma term of RMSNorm, which are 1-dimensional parameters; or convolutional layers which might have 3-dimensional or 4-dimensional array parameters.
First, a correction: Muon doesn’t only handle matrix parameters; it specifically handles “matrix parameters of linear layers with dense inputs.” If this is too confusing, just remember that matrix parameters for Embedding layers and the final classification layers (including GPT’s LM Head) should not use Muon, as the performance would be significantly worse. For these matrix parameters that cannot use Muon, as well as 1-dimensional, 3-dimensional, and higher-dimensional parameters, if readers don’t want to spend too much effort, they can simply use Adam. Most Muon implementations are mixed with Adam, allowing users to optionally use Adam for certain layers.
If readers are willing to experiment, then 3D and 4D parameters, such as those in convolutional layers, can also use Muon. Taking Conv2D as an example, the convolutional kernel shape is typically (w, h, d_in, d_out). Its equivalent implementation is to flatten the (w, h, d_in) patch input into a vector of size $w \times h \times d_{in}$, then reshape the convolutional kernel to (w\times h \times d_in, d_out), and finally perform matrix multiplication. Therefore, to use Muon, the momentum must first be reshaped to (w\times h \times d_in, d_out), calculate $\msign$, and then reshape it back for updating.
Similarly, for RMSNorm’s gamma parameter, which can be viewed as multiplication with a diagonal matrix, its momentum can be treated as a diagonal matrix to compute $\msign$, resulting in an equivalent of SignSGDM. An Embedding layer can be seen as multiple $(1,d)$ matrices for $\msign$ calculation, resulting in Normalized SGDM (refer to “Appreciation of Muon Optimizer: An Essential Leap from Vectors to Matrices”). If you want to go further, for instance, with Multi-Head Attention, could the projection matrix for each head be considered separately for $\msign$ calculation…?
Never stop experimenting~
Summary (formatted)#
Finally, if users have correctly set up and run their models according to the instructions above, they can then start praying for good fortune.
What kind of results should we expect? If no anomalies like gradient explosion occur, then in most cases, Muon will be slightly better than Adam. Of course, there might be some situations where Muon performs slightly worse, but in any case, their difference will not be very large. If one performs significantly better than the other, then you might need to reconsider whether there’s an issue with the settings on either side.
However, none of this is absolute. For instance, under certain extreme settings, it’s indeed possible for Muon to perform much better than Adam, with Adam failing to improve no matter how it’s tuned. In short, good luck to you.
@online{kexuefm-11416,
title={Muon Optimizer Guide - Quick Start and Key Details},
author={苏剑林},
year={2025},
month={11},
url={\url{https://kexue.fm/archives/11416}},
}