Skip to main content

Why is Adam's Update RMS 0.2?

·1483 words
Table of Contents

This is a gemini-2.5-flash translation of a Chinese article.

It has NOT been vetted for errors. You should have the original article open in a parallel tab at all times.

Su Jianlin | 2025-09-02 | 595 Readers

It is well-known that we started experimenting with Muon for training large-scale LLMs quite early on. Specifically, in “Muon Sequel: Why We Chose to Experiment with Muon?”, we proposed the “Match Adam Update RMS” trick to facilitate a quick migration from Adam to Muon, a trick that was also used in the training of Kimi K2. This trick involves unifying Muon’s Update RMS to 0.2, which allows us to reuse Adam’s learning rate and weight decay rate.

Behind this trick is our observation that Adam’s Update RMS is approximately 0.2, and this phenomenon is stable and reproducible. This raises an interesting question: Why is Adam’s Update RMS 0.2? Can we explain it theoretically?

Problem Introduction
#

First, let’s describe the phenomenon: from experiments, we observed that, roughly after Warmup ends and the model enters formal training, Adam’s Update RMS almost always stays between 0.2 and 0.3, and models of different sizes exhibit similar patterns. The commonality among these models is that they are all trained with Adam, using parameters $\beta_1=0.9, \beta_2=0.95$. Since the commonality is very obvious, this is likely not a coincidence, so I will try to analyze the underlying principle.

Next, let’s review the form of the Adam optimizer:

$$ \text{Adam}\color{skyblue}{\text{W}}:=\left\{\begin{aligned} &\boldsymbol{m}_t = \beta_1 \boldsymbol{m}_{t-1} + \left(1 - \beta_1\right) \boldsymbol{g}_t\\ &\boldsymbol{v}_t = \beta_2 \boldsymbol{v}_{t-1} + \left(1 - \beta_2\right) \boldsymbol{g}_t^2\\ &\hat{\boldsymbol{m}}_t = \boldsymbol{m}_t\left/\left(1 - \beta_1^t\right)\right.\\ &\hat{\boldsymbol{v}}_t = \boldsymbol{v}_t\left/\left(1 - \beta_2^t\right)\right.\\ &\boldsymbol{u}_t =\hat{\boldsymbol{m}}_t\left/\left(\sqrt{\hat{\boldsymbol{v}}_t} + \epsilon\right)\right.\\ &\boldsymbol{\theta}_t = \boldsymbol{\theta}_{t-1} - \eta_t (\boldsymbol{u}_t \color{skyblue}{ + \lambda_t \boldsymbol{\theta}_{t-1}}) \end{aligned}\right. $$

Note: All vector multiplications and divisions in this paper, including squaring, refer to Hadamard product/quotient by default.

What we need to do is prove that $\Vert\boldsymbol{u}_t\Vert_{RMS}\approx 0.2$, at least under the setting of $\beta_1=0.9, \beta_2=0.95$. Since we are concerned with the situation after stable training, we can assume that $t$ is large enough such that $\beta_1^t$ and $\beta_2^t$ are sufficiently close to 0. In this case, there’s no need to distinguish between $\boldsymbol{m}_t$ and $\hat{\boldsymbol{m}}_t$, or between $\boldsymbol{v}_t$ and $\hat{\boldsymbol{v}}_t$. At the same time, we assume that $\epsilon$ is small enough to be ignored, so we have $\boldsymbol{u}_t =\boldsymbol{m}_t/\sqrt{\boldsymbol{v}_t}$.

For $\boldsymbol{m}_t,\boldsymbol{v}_t$, we can obtain the expanded forms:

$$ \boldsymbol{m}_t = (1 - \beta_1)\sum_{i=1}^t \beta_1^{t-i}\boldsymbol{g}_i,\qquad \boldsymbol{v}_t = (1 - \beta_2)\sum_{i=1}^t \beta_2^{t-i}\boldsymbol{g}_i^2 $$

Numerical Simulation
#

If we assume that $\boldsymbol{g}_1, \boldsymbol{g}_2, \cdots, \boldsymbol{g}_t$ are all sampled from the same distribution, then we can directly estimate $\Vert\boldsymbol{u}_t\Vert_{RMS}$ using numerical simulation. Without further ado, let’s try with the simplest standard normal distribution $\mathcal{N}(\boldsymbol{0},\boldsymbol{I})$, with the reference code as follows:

import numpy as np

N, T = 10000, 2000
beta1, beta2 = 0.9, 0.95
m, v = 0, 0
for i in range(T):
    g = np.random.randn(N)
    m = beta1 * m + (1 - beta1) * g
    v = beta2 * v + (1 - beta2) * g**2

u = m / v**0.5
rms = (u**2).mean()**0.5
print(rms)

Guess what the result is? The answer is approximately 0.225, which is surprisingly similar to the experimental results! This, in turn, suggests that our simulation assumptions are quite consistent with the real situation. Some readers might disagree, thinking that $\boldsymbol{g}\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{I})$ is pure noise, how can this be consistent? Actual training, of course, cannot be pure noise. It merely suggests that the signal-to-noise ratio of a single gradient is extremely small, and thus pure noise can be used for simulation.

Readers can tinker with the reference code above to observe the variables influencing Update RMS. The general conclusion is: Update RMS is positively correlated with $\beta_1$, seems to have little relation with $\beta_2$, and if the distribution of $\boldsymbol{g}$ has a non-zero mean (equivalent to increasing the gradient’s signal-to-noise ratio), then Update RMS will also increase.

Mean-Field Approximation
#

In this section, I will attempt to derive an approximate analytical solution for the simulation results from a theoretical perspective. First, from the definition of RMS, to find $\Vert\boldsymbol{u}_t\Vert_{RMS}$, we need to first calculate $\boldsymbol{u}_t^2 = \boldsymbol{m}_t^2/\boldsymbol{v}_t$. My idea is to use the expectation of $\boldsymbol{u}_t^2$ as its approximation, and further transform it into a mean-field approximation:

$$ \mathbb{E}[\boldsymbol{u}_t^2] = \mathbb{E}\left[\frac{\boldsymbol{m}_t^2}{\boldsymbol{v}_t}\right] \approx \frac{\mathbb{E}[\boldsymbol{m}_t^2]}{\mathbb{E}[\boldsymbol{v}_t]} $$

Some readers might question the reasonableness of the last approximation step. My suggestion is to disregard these minor details for now, just like we assumed $\boldsymbol{g}\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{I})$ in the previous section. Let’s calculate first, and if the result is reasonable, then the process must be reasonable to some extent. Now we calculate the numerator and denominator separately. This time, we generally set $\mathbb{E}[\boldsymbol{g}]=\boldsymbol{\mu}, \mathbb{E}[\boldsymbol{g}^2]=\boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2$. The denominator is relatively simple:

$$ \begin{aligned} \mathbb{E}[\boldsymbol{v}_t] =&\, (1 - \beta_2)\sum_{i=1}^t \beta_2^{t-i}\mathbb{E}[\boldsymbol{g}_i^2] \\ =&\, (1 - \beta_2)\sum_{i=1}^t \beta_2^{t-i}(\boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2) \\[6pt] =&\, (1 - \beta_2^t)(\boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2) \\[8pt] =&\, \boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2\qquad (t\to\infty) \end{aligned} $$

As for the numerator, we can directly expand and calculate the square, or be a bit lazy: We need to find the second moment of $\boldsymbol{m}_t$, $\mathbb{E}[\boldsymbol{m}_t^2]$, which is equal to $\mathbb{E}[\boldsymbol{m}_t]^2 + \mathbb{V}ar[\boldsymbol{m}_t]$. Since $\boldsymbol{m}_t$ is a weighted average of $\boldsymbol{g}_i$, it must be that $\mathbb{E}[\boldsymbol{m}_t]=\mathbb{E}[\boldsymbol{g}_i]=\boldsymbol{\mu}$. As for the variance, it has the property of additivity of squares, hence:

$$ \mathbb{V}ar[\boldsymbol{m}_t] = (1 - \beta_1)^2\sum_{i=1}^t \beta_1^{2(t-i)}\boldsymbol{\sigma}^2 = \frac{(1 - \beta_1)^2 (1 - \beta_1^{2t})}{1 - \beta_1^2}\boldsymbol{\sigma}^2= \frac{1 - \beta_1}{1 + \beta_1}\boldsymbol{\sigma}^2\qquad (t\to\infty) $$

Therefore:

$$ \mathbb{E}[\boldsymbol{u}_t^2]\approx \frac{\boldsymbol{\mu}^2 + \frac{1 - \beta_1}{1 + \beta_1}\boldsymbol{\sigma}^2}{\boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2} $$

Result Analysis
#

Since $\mathbb{E}[\boldsymbol{u}_t^2]$ is already a squared vector, to estimate $\Vert\boldsymbol{u}_t\Vert_{RMS}$, we only need to average its components and then take the square root. For the averaging step, we might as well perform another mean-field approximation (averaging the numerator and denominator separately), which will finally yield:

$$ \Vert\boldsymbol{u}_t\Vert_{RMS} \approx \sqrt{\frac{\Vert\boldsymbol{\mu}\Vert^2 + \frac{1 - \beta_1}{1 + \beta_1}\Vert\boldsymbol{\sigma}\Vert^2}{\Vert\boldsymbol{\mu}\Vert^2 + \Vert\boldsymbol{\sigma}\Vert^2}} = \sqrt{\frac{\Vert\boldsymbol{\mu}\Vert^2/\Vert\boldsymbol{\sigma}\Vert^2 + \frac{1 - \beta_1}{1 + \beta_1}}{\Vert\boldsymbol{\mu}\Vert^2/\Vert\boldsymbol{\sigma}\Vert^2 + 1}} $$

It has two influencing factors: one is $\Vert\boldsymbol{\mu}\Vert^2/\Vert\boldsymbol{\sigma}\Vert^2$, which can be seen as the signal-to-noise ratio (SNR) of the gradient; the other is $\beta_1$, which is one of Adam’s hyperparameters. Notably, the result does not depend on $\beta_2$, which is consistent with the previous simulation results. So, how good is this approximation? Let’s consider the simplest special case, $\boldsymbol{\mu}=\boldsymbol{0}$, where:

$$ \Vert\boldsymbol{u}_t\Vert_{RMS} \approx \sqrt{\frac{1 - \beta_1}{1 + \beta_1}} $$

Compared with the simulation results, it is as follows:

Simulation results vs. mean-field approximation (different beta1, beta2)

It should be said that the approximation is quite good, especially after $\beta_2 \geq 0.9$, where the results almost perfectly match the mean-field approximation. As for the comparison results considering SNR, they are as follows:

Simulation results vs. mean-field approximation (different beta1, SNR)

When the signal-to-noise ratio increases, the error of the mean-field approximation starts to grow, but it still predicts the overall trend. In fact, in actual training, the gradient’s signal-to-noise ratio rarely gets as large as close to 1, so the mean-field approximation can still be considered good.

Reverse Prediction
#

If we accept the mean-field approximation (the equation derived earlier), then we can use it in reverse to estimate the gradient’s signal-to-noise ratio:

$$ \frac{\Vert\boldsymbol{\mu}\Vert^2}{\Vert\boldsymbol{\sigma}\Vert^2} \approx \frac{\Vert\boldsymbol{u}_t\Vert_{RMS}^2 - \frac{1 - \beta_1}{1 + \beta_1}}{1 - \Vert\boldsymbol{u}_t\Vert_{RMS}^2} $$

In actual training, $\beta_1$ is given, and $\Vert\boldsymbol{u}_t\Vert_{RMS}$ (i.e., Adam’s Update RMS) can also be directly estimated, so the above equation is computable. Of course, this equation is only applicable to Adam. Is there a more general estimation approach? Indeed there is! Don’t forget what we estimated earlier:

$$ \mathbb{E}[\boldsymbol{m}_t^2]\approx \boldsymbol{\mu}^2 + \frac{1 - \beta_1}{1 + \beta_1}\boldsymbol{\sigma}^2 $$

Then, by summing its components and taking the square root, we consider it an approximation of $\Vert\boldsymbol{m}_t\Vert$:

$$ \Vert\boldsymbol{m}_t\Vert\approx \sqrt{\Vert\boldsymbol{\mu}\Vert^2 + \frac{1 - \beta_1}{1 + \beta_1}\Vert\boldsymbol{\sigma}\Vert^2} $$

As for the second moment, it is $\mathbb{E}[\boldsymbol{v}_t]\approx \boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2$. Optimizers like Muon do not have a second moment available, but we noticed that the result for the second moment is independent of $\beta_2$. Therefore, we might consider the simplest special case, where $\beta_2=0$, in which case $\boldsymbol{v}_t=\boldsymbol{g}_t^2$. Of course, this might be a bit forced, but for estimation, we certainly go with whatever is most convenient. This “approximation” implies $\Vert\boldsymbol{g}_t\Vert^2\approx \Vert\boldsymbol{\mu}\Vert^2 + \Vert\boldsymbol{\sigma}\Vert^2$. Thus, we have:

$$ \frac{\Vert\boldsymbol{m}_t\Vert}{\Vert\boldsymbol{g}_t\Vert}\approx \sqrt{\frac{\Vert\boldsymbol{\mu}\Vert^2 + \frac{1 - \beta_1}{1 + \beta_1}\Vert\boldsymbol{\sigma}\Vert^2}{\Vert\boldsymbol{\mu}\Vert^2 + \Vert\boldsymbol{\sigma}\Vert^2}} $$

The form of the right-hand side is exactly the same as the equation derived earlier, so we can write:

$$ \frac{\Vert\boldsymbol{\mu}\Vert^2}{\Vert\boldsymbol{\sigma}\Vert^2} \approx \frac{\Vert\boldsymbol{m}_t\Vert^2/\Vert\boldsymbol{g}_t\Vert^2 - \frac{1 - \beta_1}{1 + \beta_1}}{1 - \Vert\boldsymbol{m}_t\Vert^2/\Vert\boldsymbol{g}_t\Vert^2} $$

That is, using $\Vert\boldsymbol{m}_t\Vert/\Vert\boldsymbol{g}_t\Vert$ to replace $\Vert\boldsymbol{u}_t\Vert_{RMS}$. This provides a general approach for momentum optimizers to estimate $\Vert\boldsymbol{\mu}\Vert^2/\Vert\boldsymbol{\sigma}\Vert^2$. Some readers might still wonder what to do if there’s no momentum. In that case, there really is no way, because $\Vert\boldsymbol{\mu}\Vert^2/\Vert\boldsymbol{\sigma}\Vert^2$ here is a statistic across optimization trajectories, and we always need some cross-trajectory statistical information to be able to estimate it.

Summary (formatted)
#

This article mainly explored Adam’s Update RMS from two perspectives: simulation experiments and theoretical approximation. It can serve as one of the theoretical bases for aligning the Update RMS to 0.2 in our Muon optimizer.

@online{kexuefm-11267,
        title={Why is Adam's Update RMS 0.2?},
        author={苏剑林},
        year={2025},
        month={09},
        url={\url{https://kexue.fm/archives/11267}},
}