Asymptotic Estimation of AdamW's Weight RMS (Part 2)

This is a gemini-2.5-flash translation of a Chinese article.

It has NOT been vetted for errors. You should have the original article open in a parallel tab at all times.

By Su Jianlin | 2025-11-17 | 191 Readers

In the blog post “Asymptotic Estimation of AdamW’s Weight RMS (Part 1)”, we derived the asymptotic expression for the RMS of model weights trained with AdamW. However, at that time, we assumed that both Weight Decay and the learning rate were fixed throughout the training process, which doesn’t completely align with actual training. Therefore, in this article, we will generalize our previous conclusions to a dynamic version.

The dynamic version allows both Weight Decay and the learning rate to change as the number of training steps increases, such as with classic Cosine Decay or WSD (Warmup Stable Decay), making the conclusions more general.

Step One
#

Our starting point is still the definition of AdamW:

$$ \text{Adam} {\color{skyblue}{\text{W}}} :=\left\{\begin{aligned} &\boldsymbol{m}_t = \beta_1 \boldsymbol{m}_{t-1} + \left(1 - \beta_1\right) \boldsymbol{g}_t\\ &\boldsymbol{v}_t = \beta_2 \boldsymbol{v}_{t-1} + \left(1 - \beta_2\right) \boldsymbol{g}_t^2\\ &\hat{\boldsymbol{m}}_t = \boldsymbol{m}_t\left/\left(1 - \beta_1^t\right)\right.\\ &\hat{\boldsymbol{v}}_t = \boldsymbol{v}_t\left/\left(1 - \beta_2^t\right)\right.\\ &\boldsymbol{u}_t =\hat{\boldsymbol{m}}_t\left/\left(\sqrt{\hat{\boldsymbol{v}}_t} + \epsilon\right)\right.\\ &\boldsymbol{\theta}_t = \boldsymbol{\theta}_{t-1} - \eta_t (\boldsymbol{u}_t {\color{skyblue}{ + \lambda_t \boldsymbol{\theta}_{t-1}}}) \end{aligned}\right. $$

Since $\eta_t\lambda_t\ll 1$, we can write:

$$ \boldsymbol{\theta}_t = (1 - \eta_t\lambda_t)\boldsymbol{\theta}_{t-1} -\eta_t\boldsymbol{u}_t \approx e^{- \eta_t\lambda_t}\boldsymbol{\theta}_{t-1} -\eta_t\boldsymbol{u}_t $$

Let $\kappa_t = \sum_{i=1}^t \eta_i\lambda_i$. Then, direct expansion yields:

$$ \boldsymbol{\theta}_t \approx e^{-\kappa_t}\boldsymbol{\theta}_0 - \sum_{i=1}^t e^{-(\kappa_t - \kappa_i)}\eta_i\boldsymbol{u}_i = e^{-\kappa_t}\left(\boldsymbol{\theta}_0 - \sum_{i=1}^t e^{\kappa_i}\eta_i\boldsymbol{u}_i\right) $$

Now, let $z_t = \sum_{i=1}^t e^{\kappa_i}\eta_i$. By the mean-field approximation, we get:

$$ \bar{\boldsymbol{u}}_t\triangleq\frac{1}{z_t}\sum_{i=1}^t e^{\kappa_i}\eta_i \boldsymbol{u}_i = \frac{1}{z_t}\sum_{i=1}^t e^{\kappa_i}\eta_i \frac{\boldsymbol{m}_i}{\sqrt{\boldsymbol{v}_i}}\approx \frac{\bar{\boldsymbol{m}}_t \,\,\triangleq\,\, \frac{1}{z_t}\sum_{i=1}^t e^{\kappa_i}\eta_i\boldsymbol{m}_i}{\sqrt{\bar{\boldsymbol{v}}_t \,\,\triangleq\,\, \frac{1}{z_t}\sum_{i=1}^t e^{\kappa_i}\eta_i\boldsymbol{v}_i}} $$

This leads to:

$$ \Vert\boldsymbol{\theta}_t\Vert_{RMS}^2 \approx e^{-2\kappa_t}\Vert\boldsymbol{\theta}_0 - z_t \bar{\boldsymbol{u}}_t\Vert_{RMS}^2 \approx e^{-2\kappa_t}\Vert\boldsymbol{\theta}_0\Vert_{RMS}^2 + e^{-2\kappa_t}z_t^2\Vert\bar{\boldsymbol{u}}_t\Vert_{RMS}^2 $$

Step Two
#

Following the previous approach, to estimate $\Vert \bar{\boldsymbol{u}}_t\Vert_{RMS}^2$, we need to assume that $\boldsymbol{g}_j$ are independently and identically distributed following $\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\sigma}^2)$, and then compute:

$$ \mathbb{E}[\bar{\boldsymbol{u}}_t^2] \approx \mathbb{E}\left[\frac{\bar{\boldsymbol{m}}_t^2}{\bar{\boldsymbol{v}}_t}\right] \approx \frac{\mathbb{E}[\bar{\boldsymbol{m}}_t^2]}{\mathbb{E}[\bar{\boldsymbol{v}}_t]} $$

Finally, by averaging the components of $\mathbb{E}[\bar{\boldsymbol{u}}_t^2]$, the result can serve as an approximation for $\Vert \bar{\boldsymbol{u}}_t\Vert_{RMS}^2$.

Expanding $\boldsymbol{m}_t$ and $\boldsymbol{v}_t$ gives:

$$ \boldsymbol{m}_t = (1 - \beta_1)\sum_{i=1}^t \beta_1^{t-i}\boldsymbol{g}_i,\qquad \boldsymbol{v}_t = (1 - \beta_2)\sum_{i=1}^t \beta_2^{t-i}\boldsymbol{g}_i^2 $$

We also have the identity:

$$ \sum_{i=1}^t \sum_{j=1}^i a_i b_j = \sum_{j=1}^t \sum_{i=j}^t a_i b_j $$

Using these two results, we can write:

$$ \begin{gather} \bar{\boldsymbol{m}}_t = \frac{1}{z_t}\sum_{i=1}^t e^{\kappa_i}\eta_i\boldsymbol{m}_i = \frac{1 - \beta_1}{z_t}\sum_{i=1}^t e^{\kappa_i}\eta_i\sum_{j=1}^i \beta_1^{i-j}\boldsymbol{g}_j = \sum_{j=1}^t\boldsymbol{g}_j\underbrace{\frac{1 - \beta_1}{z_t}\sum_{i=j}^t e^{\kappa_i}\beta_1^{i-j}\eta_i}_{\text{denoted as}\bar{\beta}_1(j,t)} \\ \bar{\boldsymbol{v}}_t = \frac{1}{z_t}\sum_{i=1}^t e^{\kappa_i}\eta_i\boldsymbol{v}_i = \frac{1 - \beta_2}{z_t}\sum_{i=1}^t e^{\kappa_i}\eta_i\sum_{j=1}^i \beta_2^{i-j}\boldsymbol{g}_j^2 = \sum_{j=1}^t\boldsymbol{g}_j^2\underbrace{\frac{1 - \beta_2}{z_t}\sum_{i=j}^t e^{\kappa_i}\beta_2^{i-j}\eta_i}_{\text{denoted as}\bar{\beta}_2(j,t)} \\ \end{gather} $$

Step Three
#

First, let’s find the denominator. When $t$ is sufficiently large ($\beta_1^t, \beta_2^t$ are sufficiently small), $\sum_{j=1}^t \bar{\beta}_1(j,t)$ and $\sum_{j=1}^t \bar{\beta}_2(j,t)$ will be sufficiently close to 1 (because they are essentially forms of double-weighted averages, just with the summation order swapped). Therefore, we have:

$$ \mathbb{E}[\bar{\boldsymbol{v}}_t] = \sum_{j=1}^t\bar{\beta}_2(j,t) \mathbb{E}[\boldsymbol{g}_j^2] = \sum_{j=1}^t\bar{\beta}_2(j,t) (\boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2) \approx \boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2 $$

Similarly for $\mathbb{E}[\bar{\boldsymbol{m}}_t]$, the result is $\boldsymbol{\mu}$. And for $\mathbb{E}[\bar{\boldsymbol{m}}_t^2] = \mathbb{E}[\bar{\boldsymbol{m}}_t]^2 + \mathbb{V}ar[\bar{\boldsymbol{m}}_t]$, using the additive property of variance, we get:

$$ \mathbb{V}ar[\bar{\boldsymbol{m}}_t] = \sum_{j=1}^t\bar{\beta}_1(j,t)^2 \mathbb{V}ar[\boldsymbol{g}_j] = \sum_{j=1}^t\bar{\beta}_1(j,t)^2 \boldsymbol{\sigma}^2 $$

Thus,

$$ \mathbb{E}[\bar{\boldsymbol{u}}_t^2] \approx \frac{\boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2\sum_{j=1}^t\bar{\beta}_1(j,t)^2}{\boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2} $$

and

$$ \Vert\bar{\boldsymbol{u}}_t\Vert_{RMS}^2 \approx \frac{\Vert\boldsymbol{\mu}\Vert^2/\Vert\boldsymbol{\sigma}\Vert^2 + \sum_{j=1}^t\bar{\beta}_1(j,t)^2}{\Vert\boldsymbol{\mu}\Vert^2/\Vert\boldsymbol{\sigma}\Vert^2 + 1} $$

Finally, we have:

$$ \Vert\boldsymbol{\theta}_t\Vert_{RMS}^2 \approx e^{-2\kappa_t}\Vert\boldsymbol{\theta}_0\Vert_{RMS}^2 + e^{-2\kappa_t}z_t^2\frac{\Vert\boldsymbol{\mu}\Vert^2/\Vert\boldsymbol{\sigma}\Vert^2 + \sum_{j=1}^t\bar{\beta}_1(j,t)^2}{\Vert\boldsymbol{\mu}\Vert^2/\Vert\boldsymbol{\sigma}\Vert^2 + 1} $$

If readers are directly viewing this article, some steps might seem a bit abrupt. In that case, it’s advisable to revisit “Asymptotic Estimation of AdamW’s Weight RMS (Part 1)” to familiarize yourselves with the reasoning behind each approximation.

Example One
#

First, consider $\boldsymbol{\mu}=\boldsymbol{0}$. Substituting the expression for $\bar{\beta}_1(j,t)$ into the above equation gives:

$$ \Vert\boldsymbol{\theta}_t\Vert_{RMS}^2 \approx e^{-2\kappa_t}\Vert\boldsymbol{\theta}_0\Vert_{RMS}^2 + e^{-2\kappa_t}(1-\beta_1)^2\sum_{j=1}^t\left(\sum_{i=j}^t e^{\kappa_i}\beta_1^{i-j}\eta_i\right)^2 $$

Next, consider a simple case where $\lambda_t=0$, i.e., no Weight Decay. In this situation:

$$ \Vert\boldsymbol{\theta}_t\Vert_{RMS}^2 \approx \Vert\boldsymbol{\theta}_0\Vert_{RMS}^2 + (1-\beta_1)^2\sum_{j=1}^t\left(\sum_{i=j}^t \beta_1^{i-j}\eta_i\right)^2 $$

If $\beta_1\to 0$, then we immediately have $\Vert\boldsymbol{\theta}_t\Vert_{RMS}^2 \approx \Vert\boldsymbol{\theta}_0\Vert_{RMS}^2 + \sum_{j=1}^t\eta_j^2$. This indicates that without Weight Decay, and as the number of training steps $t\to\infty$, for the Weight RMS not to explode, the sum of squares of the learning rate sequence must converge. This is also one of the classic conditions in traditional optimization theory. In fact, even if $0 < \beta_1 < 1$, this condition is necessary and sufficient, i.e.,

$$ \sum_{j=1}^{\infty}\left(\sum_{i=j}^{\infty} \beta_1^{i-j}\eta_i\right)^2 < \infty \qquad\Leftrightarrow\qquad \sum_{j=1}^{\infty}\eta_j^2 < \infty $$

The proof is not difficult. Let’s transform the left side:

$$ \begin{aligned} \sum_{j=1}^{\infty}\left(\sum_{i=j}^{\infty} \beta_1^{i-j}\eta_i\right)^2 =& \sum_{j=1}^{\infty}\left(\sum_{i_1=0}^{\infty} \beta_1^{i_1}\eta_{i_1+j}\right)\left(\sum_{i_2=0}^{\infty} \beta_1^{i_2}\eta_{i_2+j}\right) \\ =&\, \sum_{i_1=0}^{\infty}\sum_{i_2=0}^{\infty} \beta_1^{i_1 + i_2}\sum_{j=1}^{\infty}\eta_{i_1+j}\eta_{i_2+j} \end{aligned} $$

This shows that if the left side converges, then for all $\forall i_1, i_2$, the sum $\sum_{j=1}^{\infty}\eta_{i_1+j}\eta_{i_2+j}$ will converge. This naturally implies that $\sum_{j=1}^{\infty}\eta_j^2$ converges, proving the necessity. For sufficiency, we can start from the above equation and use the Cauchy-Schwarz inequality:

$$ \begin{aligned} \sum_{i_1=0}^{\infty}\sum_{i_2=0}^{\infty} \beta_1^{i_1 + i_2}\sum_{j=1}^{\infty}\eta_{i_1+j}\eta_{i_2+j} \leq&\, \sum_{i_1=0}^{\infty}\sum_{i_2=0}^{\infty} \beta_1^{i_1 + i_2}\sqrt{\left(\sum_{j=1}^{\infty}\eta_{i_1+j}^2\right)\left(\sum_{j=1}^{\infty}\eta_{i_2+j}^2\right)} \\ \leq&\, \sum_{i_1=0}^{\infty}\sum_{i_2=0}^{\infty} \beta_1^{i_1 + i_2}\sqrt{\left(\sum_{j=1}^{\infty}\eta_j^2\right)\left(\sum_{j=1}^{\infty}\eta_j^2\right)} \\ =&\, \frac{1}{(1-\beta_1)^2} \sum_{j=1}^{\infty}\eta_j^2 \end{aligned} $$

Therefore, the convergence of $\sum_{j=1}^{\infty}\eta_j^2$ implies the convergence of the left side, proving sufficiency.

Example Two
#

Next, let’s consider the case where Weight Decay is constant but the learning rate is variable. In this case, $\kappa_t = \lambda\sum_{i=1}^t \eta_i$. If we want to train indefinitely to get a solution as close as possible to the theoretical optimum, the learning rate should satisfy $\sum_{i=1}^{\infty} \eta_i \to \infty$. This is necessary for the first term $e^{-2\kappa_t}\Vert\boldsymbol{\theta}_0\Vert_{RMS}^2$ to completely “forget” the initialization (a theoretically optimal solution should be independent of initialization). Interestingly, this is also one of the classic conditions in traditional optimization theory.

For the general case, computing the equation derived for $\boldsymbol{\mu}=\boldsymbol{0}$ is quite difficult, but we can consider further approximations based on practical scenarios. In practical training, typically $\lambda_t \eta_t \ll 1$, so the growth rate of $e^{\kappa_i}$ is much slower than the decay rate of $\beta_1^i$. At the same time, the learning rate $\eta_i$ is usually slowly changing compared to $\beta_1^i$. Therefore, we can consider the approximation:

$$ \sum_{i=j}^t e^{\kappa_i}\beta_1^{i-j}\eta_i \approx \sum_{i=j}^t e^{\kappa_j}\beta_1^{i-j}\eta_j = e^{\kappa_j}\eta_j\sum_{i=j}^t\beta_1^{i-j}\approx e^{\kappa_j}\eta_j\sum_{i=j}^{\infty}\beta_1^{i-j} = \frac{e^{\kappa_j}\eta_j}{1-\beta_1} $$

Substituting this approximation back into the equation for $\boldsymbol{\mu}=\boldsymbol{0}$ gives:

$$ \Vert\boldsymbol{\theta}_t\Vert_{RMS}^2 \approx e^{-2\kappa_t}\Vert\boldsymbol{\theta}_0\Vert_{RMS}^2 + e^{- 2\kappa_t}\sum_{j=1}^t e^{2\kappa_j}\eta_j^2 $$

Next, we can only compute specific cases for $\eta_j$. For instance, when both $\lambda_j$ and $\eta_j$ are constant, we can calculate $\kappa_t = \lambda\eta t$, and:

$$ \begin{aligned} \Vert\boldsymbol{\theta}_t\Vert_{RMS}^2 \approx&\, e^{-2\kappa_t}\Vert\boldsymbol{\theta}_0\Vert_{RMS}^2 + e^{- 2\kappa_t}\sum_{j=1}^t e^{2\kappa_j}\eta_j^2 \\ =&\, e^{-2\lambda\eta t}\Vert\boldsymbol{\theta}_0\Vert_{RMS}^2 + e^{-2\lambda\eta t}\sum_{j=1}^t e^{2\lambda\eta j}\eta^2 \\ =&\, e^{-2\lambda\eta t}\Vert\boldsymbol{\theta}_0\Vert_{RMS}^2 + \frac{e^{2\lambda\eta}(1 - e^{-2\lambda\eta t})}{e^{2\lambda\eta} - 1}\eta^2 \\ \approx&\, e^{-2\lambda\eta t}\Vert\boldsymbol{\theta}_0\Vert_{RMS}^2 + (1 - e^{-2\lambda\eta t} )\frac{\eta}{2\lambda} \end{aligned} $$

This is consistent with the results from the previous article.

Differential Equation
#

For numerical computation, the simplified equation is already quite concise. However, if we want to obtain analytical results for general $\lambda_t, \eta_t$, it is usually still difficult, so we need to further explore new computational methods.

Considering that integrals are generally easier to compute than sums, we can try to approximate the sum with an integral:

$$ \Vert\boldsymbol{\theta}_t\Vert_{RMS}^2 \approx e^{-2\kappa_t}\Vert\boldsymbol{\theta}_0\Vert_{RMS}^2 + e^{- 2\kappa_t}\sum_{j=1}^t e^{2\kappa_j}\eta_j^2\approx e^{-2\kappa_t}\Vert\boldsymbol{\theta}_0\Vert_{RMS}^2 + e^{- 2\kappa_t}\int_0^t e^{2\kappa_s}\eta_s^2 ds $$

where $\kappa_t = \int_0^t \lambda_s\eta_s ds$. Let $\rho_t = \Vert\boldsymbol{\theta}_t\Vert_{RMS}^2$. Multiplying both sides by $e^{2\kappa_t}$ and then differentiating yields $\frac{d}{dt}(e^{2\kappa_t}\rho_t) \approx e^{2\kappa_t}\eta_t^2$. Rearranging, we get:

$$ \frac{d}{dt}\rho_t \approx -2\lambda_t\eta_t\rho_t + \eta_t^2 $$

This is the differential equation satisfied by the squared RMS, which can be said to be not very complicated. If $\rho_t$ converges to a constant as $t\to\infty$, then the left side equals 0, so:

$$ \lim_{t\to\infty} \rho_t \approx \lim_{t\to\infty} \frac{\eta_t}{2\lambda_t} $$

This tells us that for decay-type learning rate schedules, the final learning rate should not be set to 0; otherwise, there is a risk of model weights collapsing under long-term training. Of course, we can also choose to set $\lambda_t\propto \eta_t$ to avoid weight collapse.

Mean Field
#

What can be considered as $t\to\infty$ is typically multi-epoch supervised training. In pre-training scenarios, training is usually single-epoch, and in such cases, $\kappa_t$ is often $\mathcal{O}(1)$, because an excessively large $\kappa_t$ might “forget” early training samples (comparable to the weight $e^{-2\kappa_t}$ of the $\Vert\boldsymbol{\theta}_0\Vert_{RMS}^2$ term).

Under the assumption that $\kappa_t=\mathcal{O}(1)$, we can consider a mean-field approximation. Starting again from the integral form, by definition, in $[0,t]$, $\kappa_s$ is a monotonically increasing function starting from $0$ and ending at $\kappa_t$. Thus, $e^{\kappa_s} \geq 1$ and $e^{\kappa_s - \kappa_t} \leq 1$. Then,

$$ e^{- 2\kappa_t}\int_0^t \eta_s^2 ds \leq e^{- 2\kappa_t}\int_0^t e^{2\kappa_s}\eta_s^2 ds = \int_0^t e^{2\kappa_s- 2\kappa_t}\eta_s^2 ds \leq \int_0^t \eta_s^2 ds $$

That is, the target integral itself is bounded between $[e^{- 2\kappa_t} \nu_t$ and $\nu_t$, where $\nu_t = \int_0^t \eta_s^2 ds$. When $\kappa_t=\mathcal{O}(1)$, $e^{- 2\kappa_t}$ will not be much smaller than 1, which means $\nu_t$ itself is already a good approximation. Of course, we can be more precise and estimate a reasonable multiplier for $\nu_t$:

$$ e^{- 2\kappa_t}\int_0^t e^{2\kappa_s}\eta_s^2 ds \approx e^{- 2\kappa_t}\int_0^t e^{2\kappa_s} (\nu_t / t) ds = \frac{\nu_t e^{- 2\kappa_t}}{t}\int_0^t e^{2\kappa_s} ds $$

Considering that $\kappa_s$ is a monotonically increasing function from $0$ to $\kappa_t$, we approximate it with $(\kappa_t/t)s$:

$$ e^{- 2\kappa_t}\int_0^t e^{2\kappa_s}\eta_s^2 ds \approx \frac{\nu_t e^{- 2\kappa_t}}{t}\int_0^t e^{2\kappa_s} ds \approx \frac{\nu_t e^{- 2\kappa_t}}{t}\int_0^t e^{2(\kappa_t/t)s} ds = \frac{\nu_t}{2\kappa_t}(1 - e^{- 2\kappa_t}) $$

Substituting into the integral form gives:

$$ \Vert\boldsymbol{\theta}_t\Vert_{RMS}^2 \approx e^{-2\kappa_t}\Vert\boldsymbol{\theta}_0\Vert_{RMS}^2 + (1 - e^{- 2\kappa_t})\frac{\nu_t}{2\kappa_t} $$

Example Three
#

Returning to the common setting we are primarily concerned with, “fixed Weight Decay, variable learning rate,” let’s calculate $\kappa_t, \nu_t$ for a few specific examples. First, for linear learning rate:

$$ \eta_s = \eta_a + (\eta_b - \eta_a) s / t $$

Here, $\eta_a$ and $\eta_b$ are the initial and final learning rates, respectively. It could be $\eta_b > \eta_a$ (e.g., Warmup) or $\eta_b < \eta_a$ (e.g., linear Decay), and $t$ is the total expected number of training steps. Integrating gives:

$$ \begin{gather} \kappa_t = \int_0^t \lambda\eta_s ds= \lambda (\eta_a + \eta_b) t / 2 \\ \nu_t = \int_0^t \eta_s^2 ds = (\eta_a^2 + \eta_a \eta_b + \eta_b^2) t / 3 \end{gather} $$

Next, for Cosine Decay:

$$ \eta_s = \eta_{\max} + (\eta_{\min} - \eta_{\max})\left(\frac{1}{2} + \frac{1}{2}\cos \frac{s\pi}{t}\right) $$

Integrating gives:

$$ \begin{gather} \kappa_t = \int_0^t \lambda\eta_s ds= \lambda (\eta_{\min} + \eta_{\max}) t / 2 \\ \nu_t = \int_0^t \eta_s^2 ds = (3\eta_{\min}^2 + 2\eta_{\min} \eta_{\max} + 3\eta_{\max}^2 ) t / 8 \end{gather} $$

Finally, for WSD (Warmup Stable Decay):

$$ \eta_s = \left\{\begin{aligned} \frac{s}{t_1}\eta_{\max}, \quad s \in [0, t_1] \\[5pt] \eta_{\max} , \quad s \in [t_1, t_2] \\[5pt] \frac{t-s}{t-t_2}\eta_{\max}, \quad s \in [t_2, t] \end{aligned}\right. $$

we have:

$$ \begin{gather} \kappa_t = \int_0^t \lambda\eta_s ds= \lambda \eta_{\max} (t + t_2 - t_1) / 2 \\ \nu_t = \int_0^t \eta_s^2 ds = \eta_{\max}^2 (t + 2t_2 - 2t_1) / 3 \end{gather} $$

Simulation Verification
#

We can also verify the above approximations through numerical simulations:

import numpy as np

N, T = 10000, 10000
beta1, beta2 = 0.9, 0.95
m, v = 0, 0
w = np.random.randn(N) * (init_std := 0.1)
lr_max, lr_min, wd = 0.001, 0.0001, 0.1
lr = lr_max + (lr_min - lr_max) * (1 + np.cos(np.arange(T) / T * np.pi)) / 2
for i in range(T):
    g = np.random.randn(N)
    m = beta1 * m + (1 - beta1) * g
    v = beta2 * v + (1 - beta2) * g**2
    w = w - lr[i] * (m / v**0.5 + wd * w)

# Direct computation ≈ 0.0774
weight_rms = (w**2).mean()**0.5

# Series approximation ≈ 0.0776
kappa = wd * lr.cumsum()
approx1 = ((np.exp(kappa * 2) * lr**2).sum() + init_std**2)**0.5 * np.exp(-kappa[-1])

# Mean-field approximation ≈ 0.0747
kappa = wd * (lr_max + lr_min) / 2 * T
nu = (lr_max**2 + lr_max * lr_min + lr_min**2) / 3 * T
approx2 = ((np.exp(kappa * 2) - 1) * nu / kappa / 2 + init_std**2)**0.5 * np.exp(-kappa)

print(weight_rms)
print(approx1)
print(approx2)

Summary (formatted)
#

This article extends the results from the previous part to a dynamic version, allowing us to estimate AdamW’s Weight RMS under time-varying learning rates and Weight Decay.

@online{kexuefm-11404,
        title={Asymptotic Estimation of AdamW's Weight RMS (Part 2)},
        author={苏剑林},
        year={2025},
        month={11},
        url={\url{https://kexue.fm/archives/11404}},
}

Step One#

Step Two#

Step Three#

Example One#

Example Two#

Differential Equation#

Mean Field#

Example Three#

Simulation Verification#

Summary (formatted)#