This is a gemini-2.5-flash
translation of a Chinese article.
It has NOT been vetted for errors. You should have the original article open in a parallel tab at all times.
These days I’ve been revisiting a Meta paper from last year, 《A Theory on Adam Instability in Large-Scale Machine Learning》. It offers a new perspective on adaptive learning rate optimizers like Adam: it suggests that the moving average of squared gradients approximates, to some extent, the estimation of the Hessian matrix squared. Thus, optimizers like Adam and RMSprop are actually approximations of the second-order Newton’s method.
This perspective is quite novel, and it appears to differ significantly from some previous Hessian approximations, making it worth studying and pondering.
Newton’s Descent#
Let the loss function be $\mathcal{L}(\boldsymbol{\theta})$, where the parameters to be optimized are $\boldsymbol{\theta}$. Our optimization objective is
$$ \boldsymbol{\theta}^* = \mathop{\text{argmin}}_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})\quad(1) $$Assuming the current value of $\boldsymbol{\theta}$ is $\boldsymbol{\theta}_t$, Newton’s method seeks $\boldsymbol{\theta}_{t+1}$ by expanding the loss function to the second order:
$$ \mathcal{L}(\boldsymbol{\theta})\approx \mathcal{L}(\boldsymbol{\theta}_t) + \boldsymbol{g}_t^{\top}(\boldsymbol{\theta} - \boldsymbol{\theta}_t) + \frac{1}{2}(\boldsymbol{\theta} - \boldsymbol{\theta}_t)^{\top}\boldsymbol{\mathcal{H}}_t(\boldsymbol{\theta} - \boldsymbol{\theta}_t) $$where $\boldsymbol{g}_t = \nabla_{\boldsymbol{\theta}_t}\mathcal{L}(\boldsymbol{\theta}_t)$ is the gradient and $\boldsymbol{\mathcal{H}}_t=\nabla_{\boldsymbol{\theta}_t}^2\mathcal{L}(\boldsymbol{\theta}_t)$ is the Hessian matrix. Assuming the positive definiteness of the Hessian matrix, the right side of the above equation has a unique minimum $\boldsymbol{\theta}_t - \boldsymbol{\mathcal{H}}_t^{-1}\boldsymbol{g}_t$. Newton’s method sets this as the next $\boldsymbol{\theta}_{t+1}$:
$$ \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t-\boldsymbol{\mathcal{H}}_t^{-1}\boldsymbol{g}_t = \boldsymbol{\theta}_t - (\nabla_{\boldsymbol{\theta}_t}^2\mathcal{L})^{-1} \nabla_{\boldsymbol{\theta}_t}\mathcal{L} $$Note that the above equation has no additional learning rate parameter, thus Newton’s method is inherently an adaptive learning rate algorithm. Of course, since the complexity of the Hessian matrix is proportional to the square of the number of parameters, a full Newton’s method in deep learning is mostly of theoretical value. To actually apply Newton’s method, significant simplifying assumptions must be made about the Hessian matrix, such as assuming it’s a diagonal or low-rank matrix.
From the perspective of Newton’s method, SGD assumes $\boldsymbol{\mathcal{H}}_t=\eta_t^{-1}\boldsymbol{I}$, while Adam assumes $\boldsymbol{\mathcal{H}}_t=\eta_t^{-1}\text{diag}(\sqrt{\hat{\boldsymbol{v}}_t} + \epsilon)$, where
$$ \text{Adam}:=\left\{\begin{aligned} &\boldsymbol{m}_t = \beta_1 \boldsymbol{m}_{t-1} + \left(1 - \beta_1\right) \boldsymbol{g}_t\\ &\boldsymbol{v}_t = \beta_2 \boldsymbol{v}_{t-1} + \left(1 - \beta_2\right) \boldsymbol{g}_t\odot\boldsymbol{g}_t\\ &\hat{\boldsymbol{m}}_t = \boldsymbol{m}_t\left/\left(1 - \beta_1^t\right)\right.\\ &\hat{\boldsymbol{v}}_t = \boldsymbol{v}_t\left/\left(1 - \beta_2^t\right)\right.\\ &\boldsymbol{\theta}_t = \boldsymbol{\theta}_{t-1} - \eta_t \hat{\boldsymbol{m}}_t\left/\left(\sqrt{\hat{\boldsymbol{v}}_t} + \epsilon\right)\right. \end{aligned}\right. $$Next, we want to prove that $\eta_t^{-1}\text{diag}(\sqrt{\hat{\boldsymbol{v}}_t})$ is a better approximation of $\boldsymbol{\mathcal{H}}_t$.
Gradient Approximation#
The key to the proof is to use the first-order approximation of the gradient:
$$ \boldsymbol{g}_{\boldsymbol{\theta}} \approx \boldsymbol{g}_{\boldsymbol{\theta}^*} + \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}(\boldsymbol{\theta} - \boldsymbol{\theta}^*) $$where $\boldsymbol{g}_{\boldsymbol{\theta}^*}$ and $\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}$ indicate that we expand at $\boldsymbol{\theta}=\boldsymbol{\theta}^*$. Here, $\boldsymbol{\theta}^*$ is the target we seek in (1), where the model’s gradient is zero. Thus, the above equation can be simplified to
$$ \boldsymbol{g}_{\boldsymbol{\theta}} \approx \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}(\boldsymbol{\theta} - \boldsymbol{\theta}^*) $$Therefore,
$$ \boldsymbol{g}_{\boldsymbol{\theta}}\boldsymbol{g}_{\boldsymbol{\theta}}^{\top} \approx \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}(\boldsymbol{\theta} - \boldsymbol{\theta}^*)(\boldsymbol{\theta} - \boldsymbol{\theta}^*)^{\top}\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}^{\top} $$Assuming that after training enters the “right track,” the model will remain “oscillating” around $\boldsymbol{\theta}^*$, converging slowly and spirally, then to some extent we can treat $\boldsymbol{\theta} - \boldsymbol{\theta}^*$ as a random variable from a normal distribution $\mathcal{N}(\boldsymbol{0},\sigma^2\boldsymbol{I})$, so
$$ \mathbb{E}[\boldsymbol{g}_{\boldsymbol{\theta}}\boldsymbol{g}_{\boldsymbol{\theta}}^{\top}] \approx \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}\mathbb{E}[(\boldsymbol{\theta} - \boldsymbol{\theta}^*)(\boldsymbol{\theta} - \boldsymbol{\theta}^*)^{\top}]\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}^{\top} = \sigma^2\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}^{\top}\quad(2) $$Assuming the Hessian matrix is diagonal, then we can retain only the diagonal elements in the above equation:
$$ \text{diag}(\mathbb{E}[\boldsymbol{g}_{\boldsymbol{\theta}}\odot\boldsymbol{g}_{\boldsymbol{\theta}}]) \approx \sigma^2\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}^2\quad\Rightarrow\quad \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*} = \frac{1}{\sigma}\text{diag}(\sqrt{\mathbb{E}[\boldsymbol{g}_{\boldsymbol{\theta}}\odot\boldsymbol{g}_{\boldsymbol{\theta}}]}) $$Isn’t this quite similar? Adam’s $\hat{\boldsymbol{v}}_t$ is a moving average of squared gradients, which can be seen as approximating $\mathbb{E}[\boldsymbol{g}_{\boldsymbol{\theta}}\odot\boldsymbol{g}_{\boldsymbol{\theta}}]$. Finally, if we further assume that $\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}_t}$ does not change much compared to $\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}$, we can conclude that $\eta_t^{-1}\text{diag}(\sqrt{\hat{\boldsymbol{v}}_t})$ is an approximation of $\boldsymbol{\mathcal{H}}_t$.
This can also explain why Adam’s $\beta_2$ is usually greater than $\beta_1$. To estimate the Hessian more accurately, the moving average of $\hat{\boldsymbol{v}}_t$ should be as “long-term” as possible (closer to a uniform average), so $\beta_2$ should be very close to 1. On the other hand, the momentum $\hat{\boldsymbol{m}}_t$ is a moving average of the gradients. If the average of the gradients is too long-term, the result will approach $\boldsymbol{g}_{\boldsymbol{\theta}^*}=\boldsymbol{0}$, which is not ideal. Therefore, the moving average for momentum should be more local.
Related Work#
For readers familiar with Hessian matrix theory, their first reaction upon seeing the above conclusion might be confusion rather than familiarity. This is because a classical approximation of the Hessian matrix is the outer product of the Jacobi matrix (similar to the gradient), whereas the Hessian approximation here is the square root of the outer product of the gradient. The two differ by a square root.
Specifically, let’s take the squared error loss as an example:
$$ \mathcal{L}(\boldsymbol{\theta}) = \frac{1}{2}\mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}[\Vert \boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x})\Vert^2]\quad(3) $$Expanding at $\boldsymbol{\theta}_t$, we have $\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x})\approx \boldsymbol{f}_{\boldsymbol{\theta}_t}(\boldsymbol{x}) + \boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}_t}^{\top} (\boldsymbol{\theta} - \boldsymbol{\theta}_t)$, where $\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}_t}=\nabla_{\boldsymbol{\theta}_t} \boldsymbol{f}_{\boldsymbol{\theta}_t}(\boldsymbol{x})$ is the Jacobi matrix. Substituting this into the above equation yields:
$$ \mathcal{L}(\boldsymbol{\theta}) \approx \frac{1}{2}\mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}[\Vert \boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}_t}(\boldsymbol{x}) - \boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}_t}^{\top} (\boldsymbol{\theta} - \boldsymbol{\theta}_t)\Vert^2] $$The simplified equation above is merely a quadratic form with respect to $\boldsymbol{\theta}$, so its Hessian matrix can be directly written as:
$$ \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}_t} \approx \mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}[\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}_t}\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}_t}^{\top}] $$This is the Hessian approximation based on the outer product of the Jacobi matrix, and it forms the theoretical basis of the “Gauss–Newton algorithm”. Of course, $\boldsymbol{\mathcal{J}}$ is not $\boldsymbol{g}$, and we need to try to relate the result to $\boldsymbol{g}$. Differentiating equation (3) directly gives:
$$ \boldsymbol{g}_{\boldsymbol{\theta}} = \mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}[\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}}(\boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}))] $$Thus,
$$ \begin{aligned} \boldsymbol{g}_{\boldsymbol{\theta}} \boldsymbol{g}_{\boldsymbol{\theta}}^{\top} =&\, \big(\mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}[\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}}(\boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}))]\big)\big(\mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}[\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}}(\boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}))]\big)^{\top} \\[5pt] =&\, \big(\mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}[\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}}(\boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}))]\big)\big(\mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}[(\boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}))^{\top}\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}}^{\top}]\big) \\[5pt] \approx&\, \mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}\big[\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}}(\boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}))(\boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}))^{\top}\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}}^{\top}\big] \\[5pt] \approx&\, \mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}\Big[\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}}\mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}\big[(\boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}))(\boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}))^{\top}\big]\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}}^{\top}\Big] \\[5pt] \end{aligned} $$Here, the two approximate equality signs do not have much theoretical basis; they can roughly be seen as mean-field approximations. The term $\boldsymbol{y} - \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x})$ is the residual of the regression prediction, which we usually assume follows $\mathcal{N}(\boldsymbol{0},\sigma^2\boldsymbol{I})$. Therefore, we have:
$$ \boldsymbol{g}_{\boldsymbol{\theta}} \boldsymbol{g}_{\boldsymbol{\theta}}^{\top} \approx \sigma^2\mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathcal{D}}\big[\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}}\boldsymbol{\mathcal{J}}_{\boldsymbol{\theta}}^{\top}\big] \approx \sigma^2 \boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}_t}\quad(4) $$This reveals the connection between $\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}_t}$ and $\boldsymbol{g}_{\boldsymbol{\theta}} \boldsymbol{g}_{\boldsymbol{\theta}}^{\top}$. Comparing this with equation (2) from the previous section, it appears to differ by a square.
Looking at the derivation process, neither result seems to have obvious errors. So, how can we understand this inconsistency? We can interpret it this way: equation (4) provides an “instantaneous approximation” of the Hessian at time $t$, while equation (2) is a “long-term average” result over time steps. The effect of long-term averaging cancels out some intensity (but theoretically also leads to a more accurate estimation), thus requiring an additional square root.
A similar effect also appears in the SDE introduced in 《A Casual Talk on Generative Diffusion Models (V): SDEs as a General Framework》. The noise term intensity in SDEs needs to be half an order higher than the non-noise term. This is similarly because noise terms cancel out under long-term averaging, so the noise needs to be of a higher order to manifest its effect in the final result.
More Connections#
In the previous derivations, we assumed that $\boldsymbol{\theta}^*$ is the theoretical optimum, so $\boldsymbol{g}_{\boldsymbol{\theta}^*} = \boldsymbol{0}$. What if $\boldsymbol{\theta}^*$ is an arbitrary point? Then equation (2) would become:
$$ \mathbb{E}[(\boldsymbol{g}_{\boldsymbol{\theta}}-\boldsymbol{g}_{\boldsymbol{\theta}^*})(\boldsymbol{g}_{\boldsymbol{\theta}}-\boldsymbol{g}_{\boldsymbol{\theta}^*})^{\top}] \approx \sigma^2\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}\boldsymbol{\mathcal{H}}_{\boldsymbol{\theta}^*}^{\top} $$This means that as long as we use a moving average of the covariance instead of the second moment, we can obtain a Hessian approximation within a local range. This exactly corresponds to the approach of the AdaBelief optimizer, where its $\boldsymbol{v}$ is the moving average of the squared difference between $\boldsymbol{g}$ and $\boldsymbol{m}$:
$$ \text{AdaBelief}:=\left\{\begin{aligned} &\boldsymbol{m}_t = \beta_1 \boldsymbol{m}_{t-1} + \left(1 - \beta_1\right) \boldsymbol{g}_t\\ &\boldsymbol{v}_t = \beta_2 \boldsymbol{v}_{t-1} + \left(1 - \beta_2\right) (\boldsymbol{g}_t - \boldsymbol{m}_t)\odot(\boldsymbol{g}_t - \boldsymbol{m}_t)\\ &\hat{\boldsymbol{m}}_t = \boldsymbol{m}_t\left/\left(1 - \beta_1^t\right)\right.\\ &\hat{\boldsymbol{v}}_t = \boldsymbol{v}_t\left/\left(1 - \beta_2^t\right)\right.\\ &\boldsymbol{\theta}_t = \boldsymbol{\theta}_{t-1} - \eta_t \hat{\boldsymbol{m}}_t\left/\left(\sqrt{\hat{\boldsymbol{v}}_t} + \epsilon\right)\right. \end{aligned}\right. $$Summary (formatted)#
This article introduces a perspective on adaptive learning rate optimizers like Adam from the viewpoint of Newton’s method and Hessian approximation, and discusses related results on Hessian approximation.
@online{kexuefm-10588,
title={Understanding Adaptive Learning Rate Optimizers from the Perspective of Hessian Approximation},
author={苏剑林},
year={2024},
month={11},
url={\url{https://kexue.fm/archives/10588}},
}