Rethinking the Relationship between Learning Rate and Batch Size (II)

Table of Contents

This is a gemini-2.5-flash translation of a Chinese article.

It has NOT been vetted for errors. You should have the original article open in a parallel tab at all times.

At the end of the previous article, "Rethinking the Relationship between Learning Rate and Batch Size (I): Current State", we mentioned that for cases where $\tilde{\boldsymbol{\varphi}}_B$ non-linearly depends on $\tilde{\boldsymbol{g}}_B$, such as SignSGD and SoftSignSGD, the computational mental burden is quite heavy, and faces the difficulty of generalization. For this reason, I invested some effort to try and simplify the derivations, and fortunately, there were some gains. The key idea among them is the subject of this article – the mean field.

The mean field is a common approximation method in physics. It does not have a fixed form, but the general idea is to move the expectation operation inside the function. In fact, in "Why is Adam’s Update RMS 0.2?", we already glimpsed the charm of the mean field. In this article, we will further witness its remarkable effectiveness in calculating the learning rate regularities of SignSGD/SoftSignSGD.

General Idea of the Method
#

Following the notation from the previous article, for SignSGD, we have $\tilde{\boldsymbol{\varphi}}_B=\sign(\tilde{\boldsymbol{g}}_B)$. We first need to compute $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]$ and $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]$, from which we can then calculate

$$ \eta^* \approx \frac{\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]^{\top}\boldsymbol{g}}{\tr(\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]\boldsymbol{H})} $$

where $\boldsymbol{g}$ is the gradient and $\boldsymbol{H}$ is the Hessian matrix. Based on the assumption, the random variable $\tilde{\boldsymbol{g}}_B$ has a mean of $\boldsymbol{g}$ and a covariance matrix of $\boldsymbol{\Sigma}/B$. Our main concern is the relationship between $\eta^*$ and Batch Size $B$. Since $\sign$ is an element-wise operation, we can start by considering a single scalar. The mean field method originated from an approximate relationship that I suddenly discovered might hold one day:

$$ \mathbb{E}[\sign(\tilde{g}_B)] = \mathbb{E}\bigg[\frac{\tilde{g}_B}{\sqrt{\tilde{g}_B^2}}\bigg]\approx \frac{\mathbb{E}[\tilde{g}_B]}{\sqrt{\mathbb{E}[\tilde{g}_B^2}]} = \frac{g}{\sqrt{g^2 + \sigma^2/B}} $$

Readers who have read "How Should the Learning Rate Change as Batch Size Increases?" might be surprised to find that this result, quickly derived in just one line, differs from the result obtained in the original article through a series of assumptions and approximations by only an insignificant constant $\pi/2$! This fact led me to realize that the mean field approximation might be entirely sufficient for the relationship between learning rate and Batch Size.

Derivations based on the mean field have numerous advantages. Firstly, there are fewer assumptions. The original derivation includes at least three assumptions: component independence, normal distribution, and approximating $\text{erf}(x)$ with $x/\sqrt{x^2+c}$. However, the mean field approximation can remove the assumption about the distribution form, only requiring the approximation itself to be applicable. Secondly, the calculation is simple; we completed the calculation in one line above, whereas the original derivation was much more complex even with numerous assumptions.

Calculation Process
#

In this section, we will use the mean field approximation to provide the complete calculation process for SignSGD. First, for the mean $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]$, the calculation in the previous section was already almost complete; here we just need to supplement a few details. We use component notation:

$$ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]_i = \mathbb{E}[\sign((\tilde{g}_B)_i)] = \mathbb{E}\bigg[\frac{(\tilde{g}_B)_i}{\sqrt{(\tilde{g}_B)_i^2}}\bigg]\approx \frac{\mathbb{E}[(\tilde{g}_B)_i]}{\sqrt{\mathbb{E}[(\tilde{g}_B)_i^2]}} = \frac{g_i}{\sqrt{g_i^2 + \sigma_i^2/B}} = \frac{\sign(g_i)}{\sqrt{1 + (\sigma_i^2/g_i^2)/B}} $$

where $\sigma_i^2 = \boldsymbol{\Sigma}_{i,i}$. Since our ultimate concern is the relationship between $\eta^*$ and $B$, and both are scalars, we apply the mean field approximation once more here to separate the denominator part related to $B$ as a scalar:

$$ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]_i \approx \frac{\sign(g_i)}{\sqrt{1 + (\sigma_i^2/g_i^2)/B}} \approx \frac{\sign(g_i)}{\sqrt{1 + \mathcal{B}_{\text{simple}}/B}} \triangleq \mu_i $$

Here, $\mathcal{B}_{\text{simple}}$ is the $\mathcal{B}_{\text{simple}} = \tr(\boldsymbol{\Sigma})/\boldsymbol{g}^{\top}\boldsymbol{g}$ from the previous article, which is also equal to $\mathbb{E}[\sigma_i^2]/\mathbb{E}[g_i^2]$ (this $\mathbb{E}$ is an average over the index $i$). That is to say, it replaces the $\sigma_i^2/g_i^2$, which was originally related to index $i$, with some average value $\mathbb{E}[\sigma_i^2]/\mathbb{E}[g_i^2]$ that is unrelated to the index. This approximation simplifies the result but still retains the functional form related to $B$.

Next is the second moment $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]$. Here, we reintroduce the assumption of component independence to simplify the result. It is possible to calculate without this assumption, but the result would be more complex and would require other assumptions to simplify the calculation, so it is better to directly introduce the independence assumption. Under the independence assumption, $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]_{i,j}$ is calculated in two parts: for $i\neq j$ and for $i=j$. When $i\neq j$,

$$ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]_{i,j} = \mathbb{E}[(\tilde{\varphi}_B)_i(\tilde{\varphi}_B)_j] = \mathbb{E}[(\tilde{\varphi}_B)_i]\mathbb{E}[(\tilde{\varphi}_B)_j] \approx \mu_i \mu_j $$

When $i=j$, it is even simpler, because the square of $\sign$ is always 1, so its expectation is naturally 1. Therefore, the total result can be simply written as $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]_{i,j}\approx \mu_i\mu_j + \delta_{i,j}(1 - \mu_i\mu_j)$.

Anomalous Phenomena
#

Substituting the above calculation results into the first equation, we obtain

$$ \eta^* \approx \frac{\sum_i |g_i|}{\frac{1}{\beta}\sum_i H_{i,i} + \beta\sum_{i\neq j} H_{i,j}\sign(g_i g_j)} $$

where $\beta = (1 + \mathcal{B}_{\text{simple}}/B)^{-1/2}$. Note that $\beta$ is monotonically increasing with respect to $B$, and $\beta\in(0,1)$, so $\beta$ can be considered a standardized Batch Size. However, the expression for $\eta^*$ is not always monotonic with respect to $\beta$, which can lead to the anomalous behavior where “as Batch Size increases, the learning rate should decrease instead.” The original paper refers to this as the “Surge phenomenon”.

Let’s understand this step by step. When $B\ll \mathcal{B}_{\text{simple}}$, we have $\beta\approx \sqrt{B/\mathcal{B}_{\text{simple}}}$, and at this point $\beta \ll 1$. Then the $1/\beta$ term in the denominator of the above equation will dominate, leading to

$$ \eta^* \approx \frac{\sum_i |g_i|}{\sum_i H_{i,i}}\beta \approx \frac{\sum_i |g_i|}{\sum_i H_{i,i}}\sqrt{B/\mathcal{B}_{\text{simple}}}\propto \sqrt{B} $$

This indicates that SignSGD’s learning rate follows square-root scaling for small Batch Sizes. Since we assume the positive definiteness of the Hessian matrix in our analysis, it must be that $\sum_i H_{i,i} > 0$. Therefore, when $\sum_{i\neq j} H_{i,j}\sign(g_i g_j) \leq 0$, the above equation is always monotonically increasing with respect to $\beta$, and thus $\eta^*$ is also monotonically increasing with respect to $B$. In this case, no anomalous behavior exists.

When $\sum_{i\neq j} H_{i,j}\sign(g_i g_j) > 0$, according to the AM-GM inequality, we can conclude that the denominator of the above equation has a minimum point at

$$ \beta^* = \sqrt{\frac{\sum_i H_{i,i}}{\sum_{i\neq j} H_{i,j}\sign(g_i g_j)}} $$

Note that $\beta\in(0, 1)$, so there is an additional condition $\beta^*\in(0, 1)$. In this case, $\eta^*$ is no longer monotonically increasing with respect to $B$, but rather increases first and then decreases. There exists a critical Batch Size, beyond which the learning rate should actually decrease. This is the “Surge phenomenon”.

Reflection on the Cause
#

Why does the Surge phenomenon, this anomalous behavior, occur? In fact, this reflects an incomplete compatibility between the optimizer’s inherent assumptions and our analysis method. Specifically, to estimate the optimal learning rate, we expanded the change in loss to a second-order approximation and assumed the positive definiteness of the Hessian matrix.

From the perspective of Newton’s method, different optimizers essentially make different assumptions about the Hessian matrix. For instance, SGD corresponds to the assumption $\boldsymbol{H}=\eta_{\max}^{-1} \boldsymbol{I}$, while SignSGD corresponds to the assumption $\boldsymbol{H}=\eta_{\max}^{-1} \diag(|\boldsymbol{g}|)$. Of course, in actual training, we can only replace $\boldsymbol{g}$ with $\tilde{\boldsymbol{g}}_B$. The Surge phenomenon actually indicates that as $B\to\infty$, the divergence between the Hessian matrix assumed by SignSGD and the actual Hessian matrix increases.

We know that current LLM models have billions of parameters. Calculating either the full Hessian matrix or the covariance matrix is almost impossible. This is also one of the reasons why we introduced the independence assumption when calculating the second moment $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]$; in this case, the covariance matrix becomes just a diagonal matrix, making estimation feasible. The situation is similar for the Hessian matrix; we can often only calculate Hessian matrices with specific structures.

For example, substituting $\boldsymbol{H}=\eta_{\max}^{-1} \diag(|\boldsymbol{g}|)$ into the second equation yields $\eta^*\approx \eta_{\max} \beta = \eta_{\max} / \sqrt{1 + \mathcal{B}_{\text{simple}}/B}$. This form is very concise and exhibits no anomalous behavior. Does this mean the Surge phenomenon would not appear? No, the Surge phenomenon is objectively present. What is more intended here is that when we observe the Surge phenomenon in experiments, perhaps the primary task is not to modify the change rule of $\eta^*$, but rather to switch optimizers.

Loss Change
#

With $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]$ and $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]$, we can also calculate $\overline{\Delta\mathcal{L}}$ as in the previous article. What is particularly interesting is that it has the same format as the SGD result:

$$ \overline{\Delta\mathcal{L}} = \mathcal{L}(\boldsymbol{w}) - \mathbb{E}[\mathcal{L}(\boldsymbol{w} - \eta^*\tilde{\boldsymbol{g}}_B)] \approx \frac{(\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]^{\top}\boldsymbol{g})^2}{2\tr(\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]\boldsymbol{H})}\approx \frac{\Delta\mathcal{L}_{\max}}{1 + \mathcal{B}_{\text{noise}}/B} $$

where

$$ \Delta\mathcal{L}_{\max} = \frac{\frac{1}{2}(\sum_i |g_i|)^2}{\sum_i H_{i,i} + \sum_{i\neq j} H_{i,j}\sign(g_i g_j)},\quad \mathcal{B}_{\text{noise}} = \frac{\mathcal{B}_{\text{simple}}\sum_i H_{i,i}}{\sum_i H_{i,i} + \sum_{i\neq j} H_{i,j}\sign(g_i g_j)} $$

Note that the full Hessian matrix is retained here, so the result is actually quite interesting: although the learning rate $\eta^*$ may exhibit the Surge phenomenon, the average increment of the loss function does not. It is always monotonically increasing with respect to $B$ and maintains the same form as SGD, meaning we can derive the same “training data size - training steps” relationship:

$$ \left(\frac{S}{S_{\min}} - 1\right)\left(\frac{E}{E_{\min}} - 1\right) = 1 $$

A more thought-provoking question is: why do the updates of SGD and SignSGD differ significantly, including distinct behaviors for the learning rate $\eta^*$, yet the relationship of $\overline{\Delta\mathcal{L}}$ with respect to $B$ takes the same form? Is this merely a coincidence, or is there a deeper principle supporting it?

General Regularities
#

Still starting from the mean field approximation, I obtained an answer that leans towards the latter. Whether for $\eta^*$ or $\overline{\Delta\mathcal{L}}$, the core difficulty lies in calculating $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]$ and $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]$, so our goal is to explore a unified calculation regularity for both.

Generally, let $\tilde{\boldsymbol{\varphi}}_B=\tilde{\boldsymbol{H}}{}_B^{-1}\tilde{\boldsymbol{g}}_B$, where $\tilde{\boldsymbol{H}}_B$ is some positive semi-definite matrix. Then we can write

$$ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B] = \mathbb{E}[\tilde{\boldsymbol{H}}{}_B^{-1}\tilde{\boldsymbol{g}}_B]\approx \underbrace{\mathbb{E}[\tilde{\boldsymbol{H}}_B]^{-1}}_{\text{denoted as }\hat{\boldsymbol{H}}{}^{-1}}\mathbb{E}[\tilde{\boldsymbol{g}}_B] = \hat{\boldsymbol{H}}{}^{-1}\boldsymbol{g} $$

and

$$ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}] = \mathbb{E}[\tilde{\boldsymbol{H}}{}_B^{-1}\tilde{\boldsymbol{g}}_B\tilde{\boldsymbol{g}}_B^{\top}\tilde{\boldsymbol{H}}{}_B^{-1}]\approx \mathbb{E}[\tilde{\boldsymbol{H}}_B]^{-1}\mathbb{E}[\tilde{\boldsymbol{g}}_B\tilde{\boldsymbol{g}}_B^{\top}]\mathbb{E}[\tilde{\boldsymbol{H}}_B]^{-1} = \hat{\boldsymbol{H}}{}^{-1}(\boldsymbol{g}\boldsymbol{g}^{\top} + \boldsymbol{\Sigma}/B)\hat{\boldsymbol{H}}{}^{-1} $$

Substituting these into the expression for $\overline{\Delta\mathcal{L}}$, we obtain

$$ \overline{\Delta\mathcal{L}} \approx \frac{1}{2}\frac{(\boldsymbol{g}^{\top}\hat{\boldsymbol{H}}{}^{-1}\boldsymbol{g})^2}{\boldsymbol{g}^{\top}\hat{\boldsymbol{H}}{}^{-1}\boldsymbol{H}\hat{\boldsymbol{H}}{}^{-1}\boldsymbol{g} + \tr(\boldsymbol{\Sigma}\hat{\boldsymbol{H}}{}^{-1}\boldsymbol{H}\hat{\boldsymbol{H}}{}^{-1})/B} $$

Note that the above equation is homogeneous with respect to $\hat{\boldsymbol{H}}$. If we assume that the relationship between $\hat{\boldsymbol{H}}$ and $B$ can be separated into a scalar form, i.e., $\hat{\boldsymbol{H}}\approx f(B) \boldsymbol{G}$, where $f(B)$ is a scalar function of $B$ and $\boldsymbol{G}$ is not explicitly related to $B$, then $f(B)$ can be canceled from both the numerator and the denominator. The final relationship with respect to $B$ can then be organized into the following form:

$$ \overline{\Delta\mathcal{L}} \approx \frac{\Delta\mathcal{L}_{\max}}{1 + \mathcal{B}_{\text{noise}}/B} $$

This proves that $\overline{\Delta\mathcal{L}}$ has the same asymptotic regularity with respect to $B$, whose core is the homogeneity with respect to $\hat{\boldsymbol{H}}$. In contrast, $\eta^*$ does not have such a unified result because it is not homogeneous with respect to $\hat{\boldsymbol{H}}$.

Effective Analysis
#

By now, everyone should have some understanding of the mean field method. Its main characteristic is computational simplicity, or more fundamentally, the mean field chooses to calculate in directions that are simple and computable, which leads to its great flexibility. However, flexibility is often also a drawback, as it means we find it difficult to grasp the regularities of the next step.

As for explaining why this approach is effective, that is even more difficult. It can only be analyzed on a case-by-case basis, and sometimes even specific problems are hard to analyze thoroughly. My feeling is that the mean field method is three parts computation, three parts luck, three parts intuition, plus one part metaphysics. Of course, trying it is fine. Let’s take the SignSGD calculation as an example and attempt an analysis.

Evidently, the most critical calculation for SignSGD is $\mathbb{E}[\sign(x)]$. Let $\mathbb{E}[x]=\mu, \mathbb{E}[x^2]=\mu^2 + \sigma^2$. Then we can write

$$ \sign(x) = \frac{x}{\sqrt{x^2}} = \frac{x}{\sqrt{\mu^2 + \sigma^2 + (x^2 - \mu^2 - \sigma^2)}} $$

Assuming $x^2 - \mu^2 - \sigma^2$ is small, we perform a Taylor expansion:

$$ \sign(x) = \frac{x}{\sqrt{\mu^2 + \sigma^2}} - \frac{1}{2}\frac{x(x^2 - \mu^2 - \sigma^2)}{(\mu^2 + \sigma^2)^{3/2}} + \frac{3}{8}\frac{x(x^2 - \mu^2 - \sigma^2)^2}{(\mu^2 + \sigma^2)^{5/2}}-\cdots $$

Now, the denominators are independent of $x$, and the numerators are polynomials of $x$. Taking the expectation on both sides, the first term is precisely the result of the mean field approximation, $\mu/\sqrt{\mu^2 + \sigma^2}$. To examine the rationality of the mean field approximation, let’s calculate the second term:

$$ \frac{1}{2}\frac{\mathbb{E}[x(x^2 - \mu^2 - \sigma^2)]}{(\mu^2 + \sigma^2)^{3/2}} = \frac{1}{2}\frac{\mathbb{E}[x^3] - (\mu^3 + \mu\sigma^2)}{(\mu^2 + \sigma^2)^{3/2}} $$

This involves $\mathbb{E}[x^3]$, which is a new statistic and a key factor in the mean field error. We can get a sense of this by considering a normal distribution $\mathcal{N}(x;\mu,\sigma^2)$, for which $\mathbb{E}[x^3]=\mu^3 + 3\mu\sigma^2$. Substituting this into the above expression:

$$ \frac{\mu\sigma^2}{(\mu^2 + \sigma^2)^{3/2}} = \frac{\sigma^2/\mu^2}{(1 + \sigma^2/\mu^2)^{3/2}} $$

The right-hand side is a bounded expression, with its maximum value attained at $\sigma^2/\mu^2=2$, yielding a result of $2/3^{3/2}=0.3849\cdots$. This indicates that the error of the mean field approximation is very likely finite, and the error term approaches 0 as $\sigma\to 0$ and $\sigma\to\infty$. These observations, to some extent, demonstrate the applicability of the mean field approximation.

Generalized Approximation
#

One reason for choosing to analyze SignSGD is that we typically use it as a theoretical approximation for Adam. In "How does Adam’s Epsilon Affect the Learning Rate Scaling Law?", we calculated a theoretically better approximation, SoftSignSGD, which considers the effect of $\epsilon$.

$$ \sign(x)=\frac{x}{\sqrt{x^2}}\quad\to\quad\softsign(x)=\frac{x}{\sqrt{x^2+\epsilon^2}} $$

In this case, $\tilde{\boldsymbol{\varphi}}_B = \softsign(\tilde{\boldsymbol{g}}_B)$. Let’s get straight to the point:

$$ \begin{aligned} &\,\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]_i = \mathbb{E}[\softsign((\tilde{g}_B)_i)] = \mathbb{E}\bigg[\frac{(\tilde{g}_B)_i}{\sqrt{(\tilde{g}_B)_i^2 + \epsilon^2}}\bigg]\approx \frac{\mathbb{E}[(\tilde{g}_B)_i]}{\sqrt{\mathbb{E}[(\tilde{g}_B)_i^2]+ \epsilon^2}} \\[8pt] =&\, \frac{g_i}{\sqrt{g_i^2 + \sigma_i^2/B + \epsilon^2}} = \frac{\softsign(g_i)}{\sqrt{1 + \sigma_i^2/(g_i^2 + \epsilon^2)/B}}\approx \frac{\softsign(g_i)}{\sqrt{1 + \mathcal{B}_{\text{simple}}/B}}\triangleq \nu_i\beta \end{aligned} $$

Here, $\mathcal{B}_{\text{simple}}$ is slightly different; it is $\tr(\boldsymbol{\Sigma})/(\boldsymbol{g}^{\top}\boldsymbol{g} + N\epsilon^2)$, where $N$ is the total number of model parameters, i.e., $\boldsymbol{g}\in\mathbb{R}^N$. As for the final terms, $\nu_i=\softsign(g_i)$ and $\beta = (1 + \mathcal{B}_{\text{simple}}/B)^{-1/2}$. Next, we calculate $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]$. Under the independence assumption, when $i\neq j$, we can still calculate the means separately, so $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]_{i,j}=\nu_i \nu_j \beta^2$. Therefore, we only need to calculate the case where $i=j$:

$$ \begin{aligned} &\,\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]_{i,i} = \mathbb{E}[\softsign((\tilde{g}_B)_i)^2] = \mathbb{E}\bigg[\frac{(\tilde{g}_B)_i^2}{(\tilde{g}_B)_i^2 + \epsilon^2}\bigg]\approx \frac{\mathbb{E}[(\tilde{g}_B)_i^2]}{\mathbb{E}[(\tilde{g}_B)_i^2]+ \epsilon^2} \\[8pt] =&\, \frac{g_i^2 + \sigma_i^2/B}{g_i^2 + \sigma_i^2/B + \epsilon^2} = 1 - \frac{1 - \softsign(g)^2}{1 + \sigma_i^2/(g_i^2 + \epsilon^2)/B}\approx 1 - \frac{1 - \softsign(g)^2}{1 + \mathcal{B}_{\text{simple}}/B} \end{aligned} $$

This can be uniformly written as $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]_{i,j}\approx \nu_i \nu_j\beta^2 + \delta_{i,j}(1-\beta^2)$, and thus

$$ \eta^* \approx \frac{\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]^{\top}\boldsymbol{g}}{\text{Tr}(\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]\boldsymbol{H})} \approx \frac{\beta\sum_i \nu_i g_i}{\sum_i H_{i,i} + \beta^2(\sum_{i,j} \nu_i \nu_j H_{i,j} - \sum_i H_{i,i})} $$

Except for $\beta$, the rest of the above expression is independent of $B$. Therefore, we have obtained an explicit relationship for $\eta^*$ with respect to $B$, which is largely similar in form to that of SignSGD. The remaining analysis can refer to "How does Adam’s Epsilon Affect the Learning Rate Scaling Law?" or follow the previous content.

Summary (formatted)
#

In this article, we re-derived the conclusions for SignSGD and SoftSignSGD using the mean field approximation, greatly simplifying the relevant calculation process, and initially pondered the general regularities of these calculations.

@online{kexuefm-11280,
        title={Rethinking the Relationship between Learning Rate and Batch Size (II) - Mean Field},
        author={苏剑林},
        year={2025},
        month={09},
        url={\url{https://kexue.fm/archives/11280}},
}

General Idea of the Method#

Calculation Process#

Anomalous Phenomena#

Reflection on the Cause#

Loss Change#

General Regularities#

Effective Analysis#

Generalized Approximation#

Summary (formatted)#