Rethinking Learning Rate and Batch Size (IV)

Table of Contents

This is a gemini-2.5-flash translation of a Chinese article.

It has NOT been vetted for errors. You should have the original article open in a parallel tab at all times.

By Su Jianlin | 2025-09-22 | 283 Readers

As mentioned in “Rethinking Learning Rate and Batch Size (II): Mean Field”, one reason we focus on SignSGD is that we often use it as a theoretical approximation of Adam, a common simplification strategy when performing theoretical analysis for Adam. Besides scenarios involving learning rate analysis, we also used this simplification in places like “Configuring Different Learning Rates, Can LoRA Improve Further?” and “Initial Exploration of MuP: Scaling Laws for Hyperparameters Across Models”.

However, is SignSGD truly a good approximation of Adam? One notable difference is that SignSGD’s Update RMS is always 1, while Adam’s is not. The author found that the core reason for this difference is momentum, which is ubiquitous in optimizers like Adam, Lion, and Muon. Therefore, this article will examine the impact of momentum—or more broadly, EMA (Exponential Moving Average).

Problem Analysis
#

From Adam’s perspective, SignSGD corresponds to the special case where $\beta_1=\beta_2=0$, or to Adam’s first update step (regardless of $\beta_1, \beta_2$). Therefore, we believe it shares some commonalities with Adam and can capture some general patterns.

However, there are also some notable differences between them. A typical one is the difference in Update RMS: SignSGD is always 1, while Adam is often significantly less than 1; additionally, Adam appears closer to SGD, resembling an intermediate version between SignSGD and SGD. Initially, the author thought this difference was due to $\epsilon$ in Adam’s denominator, which is why “How Does Adam’s Epsilon Affect the Learning Rate Scaling Law?” specifically calculated SoftSignSGD with $\epsilon$.

Later, in “Why is Adam’s Update RMS 0.2?”, we estimated Adam’s Update RMS from both simulation and theory. The mean-field approximation yielded an estimate of $\sqrt{\frac{1-\beta_1}{1+\beta_1}}$, which was verified to be consistent with both simulation results and actual experiments. This result explicitly depends on $\beta_1$, so it clearly guided our thoughts towards momentum.

This led to the following analysis. In summary, we can confirm that the role of $\epsilon$ is indeed secondary; the real protagonist is momentum—it’s the “moving average” of gradients—which is also the main subject of this article, “EMA (Exponential Moving Average)”.

Gradient Descent
#

To analyze the variables introduced by EMA, we start with SGDM, which is SGD with momentum. In practice, we rarely use SGD without momentum:

$$ \begin{aligned} &\boldsymbol{m}_t = \beta_1 \boldsymbol{m}_{t-1} + \left(1 - \beta_1\right) \boldsymbol{g}_t \\ &\boldsymbol{w}_t = \boldsymbol{w}_{t-1} - \eta_t \boldsymbol{m}_t \end{aligned} $$

In practical use, $\boldsymbol{g}_t$ is replaced by $\tilde{\boldsymbol{g}}_{B,t}$, which is a random variable with mean $\boldsymbol{g}_t$ and covariance matrix $\boldsymbol{\Sigma}_t/B$. These basic settings are the same as in “Rethinking Learning Rate and Batch Size (I): Status Quo”. The noise here is caused by randomly sampling different batches, so we can reasonably assume that $\tilde{\boldsymbol{g}}_{B,t}$ are mutually independent across different $t$.

Our task is to calculate

$$ \eta^* \approx \frac{\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]^{\top}\boldsymbol{g}}{\tr(\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]\boldsymbol{H})} $$

The relevant derivations have been provided in previous articles and will not be repeated here. For SGDM, $\tilde{\boldsymbol{\varphi}}_B = \boldsymbol{m}_t$, which can be expanded as

$$ \boldsymbol{m}_t = (1 - \beta_1)\sum\limits_{s=1}^t \beta_1^{t-s}\tilde{\boldsymbol{g}}_{B,s} $$

Magnifying Batch Size
#

Now we can calculate

$$ \mathbb{E}[\boldsymbol{m}_t] = (1 - \beta_1)\sum_{s=1}^t \beta_1^{t-s}\mathbb{E}[\tilde{\boldsymbol{g}}_{B,s}] = (1 - \beta_1)\sum_{s=1}^t \beta_1^{t-s}\boldsymbol{g}_s $$

We further assume that gradients change slowly once the model training gets “on track,” so we can approximate $\boldsymbol{g}_s$ with the current gradient $\boldsymbol{g}_t$, yielding

$$ \mathbb{E}[\boldsymbol{m}_t] = (1 - \beta_1)\sum_{s=1}^t \beta_1^{t-s}\boldsymbol{g}_t = (1 - \beta_1^t) \boldsymbol{g}_t \approx \boldsymbol{g}_t \qquad (t\to\infty) $$

As for $\mathbb{E}[\boldsymbol{m}_t \boldsymbol{m}_t^{\top}]$, we use the identity $\mathbb{E}[\boldsymbol{m}_t \boldsymbol{m}_t^{\top}] = \mathbb{E}[\boldsymbol{m}_t] \mathbb{E}[\boldsymbol{m}_t]^{\top} + \mathbb{C}\text{ov}[\boldsymbol{m}_t,\boldsymbol{m}_t]$, and then use the additivity of variance to get:

$$ \mathbb{C}\text{ov}[\boldsymbol{m}_t,\boldsymbol{m}_t] = (1 - \beta_1)^2\sum_{s=1}^t \beta_1^{2(t-s)}\boldsymbol{\Sigma}_s/B $$

Similarly, assuming the covariance matrix changes slowly, we have

$$ \mathbb{C}\text{ov}[\boldsymbol{m}_t] \approx (1 - \beta_1)^2\sum_{s=1}^t \beta_1^{2(t-s)}\boldsymbol{\Sigma}_t/B = (1 - \beta_1)^2\frac{1-\beta_1^{2t}}{1-\beta_1^2}\boldsymbol{\Sigma}_t/B = \frac{1 - \beta_1}{1 + \beta_1}\boldsymbol{\Sigma}_t/B \qquad (t\to\infty) $$

Substituting into the equation for $\eta^*$ yields

$$ \eta^* \approx \frac{\eta_{\max}}{1 + \frac{1 - \beta_1}{1 + \beta_1}\mathcal{B}_{\text{noise}}/B},\qquad \eta_{\max} = \frac{\boldsymbol{g}^{\top}\boldsymbol{g}}{\boldsymbol{g}^{\top}\boldsymbol{H}\boldsymbol{g}},\quad\mathcal{B}_{\text{noise}} = \frac{\tr(\boldsymbol{\Sigma}\boldsymbol{H})}{\boldsymbol{g}^{\top}\boldsymbol{H}\boldsymbol{g}} $$

From this result, it can be seen that the introduction of the momentum mechanism is equivalent to magnifying SGD’s Batch Size by a factor of $\frac{1 + \beta_1}{1 - \beta_1}$. According to the author’s understanding, momentum eliminates gradient noise at low cost by performing EMA on gradients along the optimization trajectory, so this result is consistent with the author’s understanding of the meaning of momentum.

Signed Momentum
#

Furthermore, we consider SignSGDM, which can be viewed as a special case of Lion, meaning SGDM with an additional $\sign$ operation:

$$ \begin{aligned} &\boldsymbol{m}_t = \beta_1 \boldsymbol{m}_{t-1} + \left(1 - \beta_1\right) \boldsymbol{g}_t \\ &\boldsymbol{w}_t = \boldsymbol{w}_{t-1} - \eta_t \sign(\boldsymbol{m}_t) \end{aligned} $$

In practical training, $\boldsymbol{g}_t$ is also replaced by $\tilde{\boldsymbol{g}}_{B,t}$. For SignSGDM, $\tilde{\boldsymbol{\varphi}}_B = \sign(\boldsymbol{m}_t)$, so according to the mean-field approximation, we get:

$$ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B] = \mathbb{E}\bigg[\frac{\boldsymbol{m}_t}{\sqrt{\boldsymbol{m}_t^2}}\bigg]\approx \frac{\mathbb{E}[\boldsymbol{m}_t]}{\sqrt{\mathbb{E}[\boldsymbol{m}_t^2]}} $$

Here, vector multiplication is Hadamard product by default. We already calculated the numerator $\mathbb{E}[\boldsymbol{m}_t]$ in the previous section. The denominator $\mathbb{E}[\boldsymbol{m}_t^2]$ is actually equal to $\diag(\mathbb{E}[\boldsymbol{m}_t \boldsymbol{m}_t^{\top}])$, so we can also substitute the result from the previous section to get:

$$ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B] \approx \frac{\boldsymbol{g}_t}{\sqrt{\boldsymbol{g}_t^2 + \frac{1 - \beta_1}{1 + \beta_1}\boldsymbol{\sigma}_t^2/B}} = \frac{\sign(\boldsymbol{g}_t)}{\sqrt{1 + \frac{1 - \beta_1}{1 + \beta_1}(\boldsymbol{\sigma}_t^2/\boldsymbol{g}_t^2)/B}} \approx \frac{\sign(\boldsymbol{g}_t)}{\sqrt{1 + \frac{1 - \beta_1}{1 + \beta_1} \mathcal{B}_{\text{simple}}/B}} $$

where $\boldsymbol{\sigma}_t^2 = \diag(\boldsymbol{\Sigma}_t)$ and $\mathcal{B}_{\text{simple}} = \tr(\boldsymbol{\Sigma}_t)/\boldsymbol{g}_t^{\top}\boldsymbol{g}_t$. The above equation is equivalent to replacing $B$ in SignSGD with $\frac{1 + \beta_1}{1 - \beta_1}B$. If we further calculate $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]$, we will find the same conclusion. Therefore, similar to SGDM, momentum is equivalent to magnifying SignSGD’s Batch Size by a factor of $\frac{1 + \beta_1}{1 - \beta_1}$.

In “Rethinking Learning Rate and Batch Size (III): Muon”, we calculated the learning rate rule for Muon and found it to be consistent with SignSGD. Thus, we can assert that the role of momentum in Muon is the same as in SignSGDM, approximately magnifying the Batch Size by a factor of $\frac{1 + \beta_1}{1 - \beta_1}$.

Double Smoothing
#

Finally, let’s look at Adam:

$$ \begin{aligned} &\boldsymbol{m}_t = \beta_1 \boldsymbol{m}_{t-1} + \left(1 - \beta_1\right) \boldsymbol{g}_t\\ &\boldsymbol{v}_t = \beta_2 \boldsymbol{v}_{t-1} + \left(1 - \beta_2\right) \boldsymbol{g}_t^2\\ &\hat{\boldsymbol{m}}_t = \boldsymbol{m}_t\left/\left(1 - \beta_1^t\right)\right.\\ &\hat{\boldsymbol{v}}_t = \boldsymbol{v}_t\left/\left(1 - \beta_2^t\right)\right.\\ &\boldsymbol{\theta}_t = \boldsymbol{\theta}_{t-1} - \eta_t \hat{\boldsymbol{m}}_t\left/\left(\sqrt{\hat{\boldsymbol{v}}_t} + \epsilon\right)\right. \end{aligned} $$

In practical training, $\boldsymbol{g}_t$ is replaced by $\tilde{\boldsymbol{g}}_{B,t}$. We are considering the state where training has entered “on track,” i.e., $t\to\infty$. Thus, we do not distinguish between $\boldsymbol{m}_t$ and $\hat{\boldsymbol{m}}_t$, nor between $\boldsymbol{v}_t$ and $\hat{\boldsymbol{v}}_t$. Additionally, as our focus is primarily on the role of EMA, we assume $\epsilon\to 0$. For Adam, we have $\tilde{\boldsymbol{\varphi}}_B=\boldsymbol{m}_t/\sqrt{\boldsymbol{v}_t}$. The difference from SignSGDM is that the denominator’s $\boldsymbol{m}_t^2$ is replaced by another EMA statistic, $\boldsymbol{v}_t$.

From the mean-field approximation, we get:

$$ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B] = \mathbb{E}\bigg[\frac{\boldsymbol{m}_t}{\sqrt{\boldsymbol{v}_t}}\bigg]\approx \frac{\mathbb{E}[\boldsymbol{m}_t]}{\sqrt{\mathbb{E}[\boldsymbol{v}_t]}} $$

We have already calculated $\mathbb{E}[\boldsymbol{m}_t]$; we only need to calculate $\mathbb{E}[\boldsymbol{v}_t]$:

$$ \mathbb{E}[\boldsymbol{v}_t] = (1 - \beta_2)\sum_{s=1}^t \beta_2^{t-s}\mathbb{E}[\tilde{\boldsymbol{g}}_{B,s}^2] = (1 - \beta_2)\sum_{s=1}^t \beta_2^{t-s}(\boldsymbol{g}_s^2 + \boldsymbol{\sigma}_s^2/B)\approx \boldsymbol{g}_t^2 + \boldsymbol{\sigma}_t^2/B $$

As before, the last approximation assumes slow variation of gradients and variance, as well as $t\to\infty$. Thus we have

$$ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B] \approx \frac{\boldsymbol{g}_t}{\sqrt{\boldsymbol{g}_t^2 + \boldsymbol{\sigma}_t^2/B}} \approx \frac{\sign(\boldsymbol{g}_t)}{\sqrt{1 + \mathcal{B}_{\text{simple}}/B}} $$

This result is indeed the same as SignSGD, so from the perspective of the first moment alone, SignSGD is a reasonable approximation for Adam. However, we also have the second moment $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B \tilde{\boldsymbol{\varphi}}_B^{\top}]$. Under the assumption of independent components, we only need to calculate $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B^2]$:

$$ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B^2] = \mathbb{E}\bigg[\frac{\boldsymbol{m}_t^2}{\boldsymbol{v}_t}\bigg]\approx \frac{\mathbb{E}[\boldsymbol{m}_t^2]}{\mathbb{E}[\boldsymbol{v}_t]} \approx \frac{\boldsymbol{g}_t^2 + \frac{1 - \beta_1}{1 + \beta_1}\boldsymbol{\sigma}_t^2/B}{\boldsymbol{g}_t^2 + \boldsymbol{\sigma}_t^2/B} $$

Two Special Cases
#

Let’s examine two special cases. First, when $\beta_1=0$, the numerator and denominator are the same, and $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B^2]$ is a vector of all ones, consistent with SignSGD. Therefore, SignSGD is a good approximation of Adam with $\beta_1=0$ (i.e., RMSProp). As $\beta_1$ increases, the approximation quality deteriorates.

When $\beta_1=1$, we have

$$ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B^2] \approx \frac{\boldsymbol{g}_t^2}{\boldsymbol{g}_t^2 + \boldsymbol{\sigma}_t^2/B}\approx \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]^2 $$

From this, we get $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B \tilde{\boldsymbol{\varphi}}_B^{\top}] \approx \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B] \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]^{\top}$. Substituting into the equation for $\eta^*$ yields

$$ \eta^* \approx \frac{\Vert \boldsymbol{g}\Vert_1 \sqrt{1 + \mathcal{B}_{\text{simple}}/B}}{\sign(\boldsymbol{g})^{\top} \boldsymbol{H} \sign(\boldsymbol{g})} $$

Note that this is a monotonically decreasing function of $B$, meaning that the learning rate should decrease as the Batch Size increases. From this, we can infer that an increase in Adam’s $\beta_1$ will accelerate the appearance of the “Surge phenomenon”.

This conclusion might seem a bit perplexing, but it becomes easier to understand from a different perspective. The “Surge phenomenon” refers to the optimal learning rate decreasing as the Batch Size increases beyond a certain threshold. The results for SGDM and SignSGDM both indicate that the introduction of momentum approximately magnifies the Batch Size by a factor of $\frac{1 + \beta_1}{1 - \beta_1} > 1$, which naturally increases the likelihood of exceeding this threshold.

In other words, the conclusion that “the ‘Surge phenomenon’ will appear more easily as $\beta_1$ increases” holds true even for SignSGDM. Adam has some new characteristics compared to SignSGDM, but the point that “the momentum mechanism is approximately equivalent to magnifying the Batch Size” always holds, so it’s not difficult to understand why the same conclusion emerges.

General Analysis
#

Let’s rewrite the equation for $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B^2]$:

$$ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B^2] \approx \frac{\boldsymbol{g}_t^2 + \frac{1 - \beta_1}{1 + \beta_1}\boldsymbol{\sigma}_t^2/B}{\boldsymbol{g}_t^2 + \boldsymbol{\sigma}_t^2/B} = \frac{2\beta_1}{1+\beta_1}\frac{\boldsymbol{g}_t^2}{\boldsymbol{g}_t^2 + \boldsymbol{\sigma}_t^2/B} + \frac{1 - \beta_1}{1 + \beta_1} \approx \frac{2\beta_1}{1+\beta_1}\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]^2 + \frac{1 - \beta_1}{1 + \beta_1} $$

From this, we can write

$$ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B \tilde{\boldsymbol{\varphi}}_B^{\top}] \approx \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B] \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]^{\top} + \frac{1 - \beta_1}{1 + \beta_1}\diag\left(1 - \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]^2\right) $$

Then

$$ \eta^* \approx \frac{\sum_i |g_i|}{\frac{1}{\beta}\frac{1 - \beta_1}{1 + \beta_1}\sum_i H_{i,i} + \beta\left(\sum_{i,j} H_{i,j}\sign(g_i g_j) - \frac{1 - \beta_1}{1 + \beta_1}\sum_i H_{i,i}\right)} $$

Here, the $\beta$ without a subscript is equal to $(1 + \mathcal{B}_{\text{simple}}/B)^{-1/2}$. Without careful attention, it might be confused with $\beta_1, \beta_2$. The author apologizes for this, as it is a notation carried over from the previous two articles. Unlike SignSGD, where assuming a diagonal Hessian matrix prevents the Surge phenomenon, the above equation still shows the Surge phenomenon even under the diagonal Hessian assumption. In this case:

$$ \eta^* \approx \frac{\sum_i |g_i|}{\left(\frac{1}{\beta}\frac{1 - \beta_1}{1 + \beta_1} + \beta\frac{2\beta_1}{1 + \beta_1}\right)\sum_i H_{i,i}} $$

According to the inequality of means, the above expression reaches its maximum at $\beta^*=\sqrt{\frac{1-\beta_1}{2\beta_1}}$. However, it’s important to note that by definition, $\beta \in(0,1)$, so we also need to check if $\beta^*\in(0,1)$, which means $\beta_1 > 1/3$. If this condition is not met, the maximum is still reached at $\beta=1$, and there is no Surge phenomenon. Conversely, when $\beta_1 > 1/3$ and $\beta > \beta^*$ (i.e., $B > \frac{1-\beta_1}{3\beta_1-1}\mathcal{B}_{\text{simple}}$), the learning rate should decrease as the Batch Size increases.

Summary (formatted)
#

This article provides a preliminary analysis of the impact of optimizer’s EMA mechanism on the scaling laws of learning rate and Batch Size. It confirms that the introduction of EMA, especially the momentum mechanism, slightly alters the scaling laws, and optimizers like Adam, with their dual EMA operations, exhibit new characteristics different from SignSGD.

@online{kexuefm-11301,
        title={Rethinking Learning Rate and Batch Size (IV) - EMA},
        author={苏剑林},
        year={2025},
        month={09},
        url={\url{https://kexue.fm/archives/11301}},
}

Problem Analysis#

Gradient Descent#

Magnifying Batch Size#

Signed Momentum#

Double Smoothing#

Two Special Cases#

General Analysis#

Summary (formatted)#