This is a gemini-2.5-flash translation of a Chinese article.
It has NOT been vetted for errors. You should have the original article open in a parallel tab at all times.
Su Jianlin | 2025-10-21 | 8263 Readers
I wonder if you’ve noticed an interesting detail: both Muon and MuP start with “Mu”, but the original meanings of the two “Mu"s are entirely different. The former stands for “MomentUm Orthogonalized by Newton-Schulz”, while the latter is “Maximal Update Parametrization”. Yet, there is indeed a very profound connection between them. That is to say, Muon and MuP started from completely different premises but ultimately converged towards the same direction, even inadvertently acquiring similar names, as if truly fulfilling the saying “it was meant to be”.
Getting back to the main topic. In short, under various serendipitous circumstances, I happened to learn about both Muon and MuP concurrently. This greatly deepened my understanding of model optimization and prompted me to ponder the more fundamental principles behind it. After a period of trial and error, I’ve made some rudimentary discoveries, which I’d like to share with you here.
Foreword#
In chronological order of their proposal, MuP came before Muon. However, my learning sequence was the opposite: I first studied Muon and then MuP. In retrospect, this proved to be a rather good learning sequence.
In articles such as Appreciation of Muon Optimizer: An Essential Leap from Vector to Matrix and Muon Sequel: Why Did We Choose to Try Muon?, we described Muon as “steepest descent under spectral norm constraint”. The MuP series of works, in turn, perfectly explains “why spectral norm constraint is needed”, with the two naturally complementing each other.
Here, a special note: the term MuP, as we use it, has two meanings. First, there’s what’s introduced in A First Look at MuP: Scaling Laws for Hyperparameters Across Models, which is part of the Tensor Programs series of works, and which we refer to as ‘Elementary MuP’. Second, there’s what’s introduced in Higher-Order MuP: Simpler but More Sophisticated Spectral Condition Scaling, which we refer to as ‘Higher-Order MuP’. This latter work obtains richer conclusions than Elementary MuP in a more concise manner—both are the work of Greg Yang (hats off to the master).
Unless otherwise specified, MuP as mentioned in this article refers to ‘Higher-Order MuP’. In fact, this series of articles, which I call ‘Beyond MuP’, is a series of reflections and extensions based on Higher-Order MuP. However, for some readers, their understanding of MuP might be the ‘Elementary MuP’ from the Tensor Programs series, so at first glance, they might wonder how MuP can answer the question ‘why is a spectral norm needed?’
Regardless, I will try to make this series self-contained, so although we will mention many related papers or blog posts during the introduction, readers do not need to read all of them in depth.
Striving for Speed with Stability#
Back to the main topic once more. As the first article in this series, our task is to define the core objective, or more specifically, to clarify “what kind of model we truly want” and “how to train such a model”.
Intuitively, as long as the model shows no signs of collapse, we can continue training it until it converges to a satisfactory performance; building on this, we would then try to find methods to make the model converge faster. So, simply put, it’s about two things: “stability” and “speed”, or rather, “striving for speed with stability”. So, how do we determine if a model is stable? This naturally involves monitoring various “internal metrics”; the more we monitor, the more problems can be exposed.
However, here I don’t intend to list all kinds of internal metrics, but rather to identify the most core or necessary conditions. To that end, let’s first define a concept—RMS (Root Mean Square): Let $\boldsymbol{x}=(x_1,x_2,\cdots,x_d)\in\mathbb{R}^d$, then we define
$$ \begin{equation}\Vert\boldsymbol{x}\Vert_{RMS} = \sqrt{\frac{1}{d}\sum_{i=1}^d x_i^2} = \frac{\Vert\boldsymbol{x}\Vert_2}{\sqrt{d}}\end{equation} $$It represents the average scale of each element, differing from the vector’s Euclidean norm $\Vert\boldsymbol{x}\Vert_2$ by a factor of $\sqrt{d}$.
Some readers might wonder, since it’s just a constant factor, why define a new concept instead of directly observing the Euclidean norm? There are several considerations: for instance, RMSNorm is often used, RMS is often easier to intuit than the norm, and critically, most activation functions are element-wise. Therefore, we need to examine and control the scale averaged across each element to ensure that activation functions perform similarly across different models.
Three Conditions#
With the RMS notation, we can state what I believe are the three most necessary conditions for stably training a good model:
$$ \begin{align} &\text{Forward Stability:}\quad\max_{\boldsymbol{x}} \Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS} = \mathcal{\Theta}(1) \\[5pt] &\text{Dependency Stability:}\quad\max_{\boldsymbol{x}_1,\boldsymbol{x}_2} \Vert \boldsymbol{f}(\boldsymbol{x}_1;\boldsymbol{\omega}) - \boldsymbol{f}(\boldsymbol{x}_2;\boldsymbol{\omega})\Vert_{RMS} = \mathcal{\Theta}(1) \\[5pt] &\text{Update Stability:}\quad\max_{\boldsymbol{x}} \Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega} + \Delta\boldsymbol{\omega}) - \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS} = \mathcal{\Theta}(1) \end{align} $$Here, $\boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})$ represents a family of models from $\mathbb{R}^{d_{in}}$ to $\mathbb{R}^{d_{out}}$, with input $\boldsymbol{x}\in\mathbb{R}^{d_{in}}$, output $\boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\in\mathbb{R}^{d_{out}}$, and $\boldsymbol{\omega}$ being the model parameters (which can be scalars, vectors, matrices, etc.). $\mathcal{\Theta}$ is the “Big Theta Notation”. Here, $\boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})$ can be a single layer, a block composed of several layers, or even the entire model. Theoretically, the coarser the granularity, the looser or more accurate the resulting constraints, but also the more difficult it is to solve for $\max$. Therefore, this depends on our ability to compute $\max$.
Among the three equations, equation (1) is probably the easiest to understand. It represents the stability of forward computation. After taking the $\max$ over $\boldsymbol{x}$, the only variable remaining is $\boldsymbol{\omega}$, so this is a constraint on $\boldsymbol{\omega}$. Please note that we have not restricted the range of $\boldsymbol{x}$ here, so by default $\boldsymbol{x}\in\mathbb{R}^{d_{in}}$. In this case, the maximum may not necessarily exist, for example, for a non-zero $\boldsymbol{W}$, we have $\max\limits_{\boldsymbol{x}}\Vert \boldsymbol{x}\boldsymbol{W}\Vert_{RMS}\to\infty$.
To ensure the existence of the maximum value, we usually need to add some Normalization operations, such as:
$$ \begin{align} &\text{Pre Norm:}\quad \mathop{\text{RMSNorm}}(\boldsymbol{x})\boldsymbol{W} \\[5pt] &\text{Post Norm:}\quad \mathop{\text{RMSNorm}}(\boldsymbol{x}\boldsymbol{W}) \end{align} $$where $\mathop{\text{RMSNorm}}(\boldsymbol{x})=\boldsymbol{x}/\Vert\boldsymbol{x}\Vert_{RMS}$. Thus, condition (1) also implicitly imposes some requirements on the model architecture. Similarly, equation (2) requires the model architecture to depend smoothly on its input. As a simple example, for $f(x;\omega)=x\times\omega\times 0 + 1$, this “model” is naturally very stable in its forward pass, but it has no dependency on $x$. Equation (2) would not be satisfied, so it’s not a good model.
The last equation (3) should also be easy to understand. After taking the $\max$ over $\boldsymbol{x}$, the result is a constraint concerning $\boldsymbol{\omega}$ and $\Delta\boldsymbol{\omega}$. It primarily focuses on the impact of the increment $\Delta\boldsymbol{\omega}$, thus representing the expectation for training stability. We can use it to guide the hyperparameter settings of optimizers, and even to construct new optimizers based on it.
Related Thoughts#
In summary, the three conditions, equations (1), (2), and (3), integrate considerations from model architecture, initialization, optimizer, and other aspects. It’s hard to argue that any one condition can be omitted, so I believe they are all essential. Of course, regarding these three conditions, there are some details worth discussing, such as the choice between $\max$ and $\mathbb{E}$.
In the current equations, we “eliminated” $\boldsymbol{x}$ by taking the $\max$, resulting in expressions dependent only on $\boldsymbol{\omega}$ and $\Delta\boldsymbol{\omega}$. Readers might have questions about this; for some, a more intuitive approach might be to compute the mathematical expectation $\mathbb{E}_{\boldsymbol{x}}$. Why $\max$ and not $\mathbb{E}$? There are several reasons. First, computing $\max$ only requires defining the domain of $\boldsymbol{x}$, whereas computing $\mathbb{E}$ requires defining the distribution of $\boldsymbol{x}$. Different distributions yield different results, and accurately defining this distribution is not a trivial matter.
Second, $\max$ has the advantage of being invariant to monotonic transformations, while $\mathbb{E}$ does not. For example, for $\max$, we have the identity $(\max_{\boldsymbol{x}} \Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS})^2 = \max_{\boldsymbol{x}} \Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS}^2$, meaning that whether we take the $\max$ of $\Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS}$ or $\Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS}^2$, the essence is the same. However, this is not the case for $\mathbb{E}$: the expectation of $\Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS}$ and the expectation of $\Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS}^2$ typically differ in computational difficulty, and their results do not necessarily have any direct relationship.
Therefore, $\max$ is conceptually and procedurally simpler. A possible concern is whether $\max$ might be too strict, akin to a “sufficient but not necessary” condition. In fact, $\max$ is just an intuitive term; mathematically, it’s called the “上确界($\sup$)”. The character ‘确’ (que, meaning ‘definite’ or ’exact’) indicates that this value is attainable and tight. In practice, the mean and the maximum are often of the same order of magnitude, and our goal is merely $\mathcal{\Theta}(1)$, so the difference is not significant. On the contrary, $\max$ accounts for extreme cases, maximally ensuring training stability, which is particularly important for training large models like LLMs.
In fact, Elementary MuP, or the Tensor Programs series, conducts its analyses based on $\mathbb{E}$, whereas Higher-Order MuP, like this article, performs analyses based on $\max$. In retrospect, analyses based on $\mathbb{E}$ are inferior to Higher-Order MuP’s $\max$-based approach in terms of computational simplicity and generality of results. This, in turn, corroborates the effectiveness of $\max$.
Summary (formatted)#
Starting with this article, I will share some top-down understandings of model optimization, which are extended reflections and expansions built upon the previous ‘Higher-Order MuP’. As the first article, we primarily described three basic conditions for model stability, or rather, three characteristics of a good model. These will serve as the cornerstone for subsequent computations and analyses.
@online{kexuefm-11340,
title={Beyond MuP: 1. Three Characteristics of a Good Model},
author={苏剑林},
year={2025},
month={10},
url={\url{https://kexue.fm/archives/11340}},
}