Why is the default norm length for gradient clipping 1?

Table of Contents

This is a gemini-2.5-flash translation of a Chinese article.

It has NOT been vetted for errors. You should have the original article open in a parallel tab at all times.

By Su Jianlin | 2025-01-02 | 69,611 Readers

As we know, Gradient Clipping is a common technique used to make model training more stable. Common gradient clipping methods clip gradients based on the total norm of all parameter gradients, and its operation can be expressed as

$$ \text{clip}(\boldsymbol{g},\tau)=\left\{\begin{aligned}&\boldsymbol{g}, & \Vert\boldsymbol{g}\Vert\leq \tau \\ &\frac{\tau}{\Vert\boldsymbol{g}\Vert}\boldsymbol{g},&\Vert\boldsymbol{g}\Vert > \tau \end{aligned}\right. $$

In this way, $\text{clip}(\boldsymbol{g},\tau)$ maintains the same direction as $\boldsymbol{g}$, but its norm does not exceed $\tau$. Note that $\Vert\boldsymbol{g}\Vert$ here is the norm calculated by treating all parameter gradients of the entire model as a single vector, also known as the Global Gradient Norm.

I wonder if anyone has noticed a detail: whether it’s a model with millions or tens of billions of parameters, the value of $\tau$ is often 1. What does this imply? Is it simply reusing the default value, or is there some profound principle hidden behind it?

What Is It?
#

Some readers might think, “The default value isn’t necessarily optimal, so what’s there to fuss about?” Indeed, $\tau=1$ may not be the optimal choice, but it is the default choice for many models, and its performance is acceptable under this default setting, which in turn suggests that $\tau=1$ has a universal reasonableness.

What does “reasonableness” refer to here? Let’s return to the $\text{clip}$ operation. If $\Vert\boldsymbol{g}\Vert$ is always less than $\tau$, then $\text{clip}$ degenerates into an identity transformation; if $\Vert\boldsymbol{g}\Vert$ is always greater than $\tau$, then $\text{clip}$ degenerates into L2 normalization. In other words, the reason $\text{clip}$ is $\text{clip}$ is that $\tau$ creates an appropriate degree of distinction, such that most $\Vert\boldsymbol{g}\Vert$ values are less than $\tau$, and only a small portion are greater than $\tau$. This is the meaning of $\tau$’s reasonableness.

Of course, there are counterexamples, and quite a few, but the main point here is to emphasize the universality of this phenomenon and the general applicability of this default setting, so meticulous readers need not dwell too much on individual details.

Therefore, we believe that the meaning of $\tau=1$’s universal reasonableness is that, regardless of the model’s parameter count, initialization method, or chosen loss function, its total gradient norm can coincidentally and roughly take $1$ as the “outlier” boundary point. This is undoubtedly a very incredible property—this was the author’s feeling when first realizing this conclusion.

Why?
#

Why such a “coincidence”? The author’s answer might be somewhat surprising: because only in this way can the model be trained stably.

Let’s consider the loss function $\mathcal{L}(\boldsymbol{\theta})$ and the optimizer update rule $\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\, \boldsymbol{u}_t$. Then the change in the loss function is approximately

$$ \Delta \mathcal{L} = \mathcal{L}(\boldsymbol{\theta}_{t+1}) - \mathcal{L}(\boldsymbol{\theta}_t) \approx (\boldsymbol{\theta}_{t+1} - \boldsymbol{\theta}_t)\cdot\nabla_{\boldsymbol{\theta}_t}\mathcal{L}(\boldsymbol{\theta}) = -\eta\, \boldsymbol{u}_t\cdot \boldsymbol{g}_t $$

First, let’s consider the simplest SGD, where $\boldsymbol{u}_t = \boldsymbol{g}_t$ and $\Delta \mathcal{L}=-\eta\Vert\boldsymbol{g}_t\Vert^2$. This means the change in the loss function is proportional to the square of the gradient norm. As we know, whether in CV or NLP, pure SGD (without momentum) is a very inefficient optimizer. In the mid-to-late stages of training, on average, the loss reduction per step for most tasks is far less than the learning rate, i.e., $|\Delta \mathcal{L}| < \eta$, from which it can be deduced that $\Vert\boldsymbol{g}_t\Vert < 1$. This indicates that $\Vert\boldsymbol{g}_t\Vert < 1$ is a long-term characteristic of a model that converges normally.

Of course, in the early stages of training, it’s normal for the model to exhibit $\Vert\boldsymbol{g}_t\Vert > 1$, but it’s rare to see $\Vert\boldsymbol{g}_t\Vert \gg 1$. In other words, a good initialization should avoid $\Vert\boldsymbol{g}_t\Vert \gg 1$, which is the theoretical basis for methods like DeepNorm. The reason is similar: if the gradient norm is too large, early learning will be too “aggressive,” leading to premature convergence to a poor local optimum. Another solution is to reduce $\eta$, which also reduces $|\Delta \mathcal{L}|$. This is why we typically use Warmup in the early stages of training.

By the way, for an understanding of Warmup, you can refer to the paper 《Optimal Linear Decay Learning Rate Schedules and Further Refinements》, which the author believes provides the most reasonable analysis of Warmup.

What to Do?
#

Simply put, because the change in the loss function is proportional to the square of the gradient norm, training stability dictates that the gradient norm cannot be too large, and in the long run, it should be less than 1. If, in the early stages, a gradient norm significantly greater than 1 appears, the usual strategy is Warmup. Alternatively, a more general strategy can be considered: setting another threshold $\mathcal{T}$ and clipping $\eta$ based on the value of $\boldsymbol{u}_t\cdot \boldsymbol{g}_t$

$$ \eta_t = \left\{\begin{aligned}&\eta,& \boldsymbol{u}_t\cdot \boldsymbol{g}_t\leq \mathcal{T} \\ &\frac{\mathcal{T}}{\boldsymbol{u}_t\cdot \boldsymbol{g}_t}\eta,& \boldsymbol{u}_t\cdot \boldsymbol{g}_t > \mathcal{T} \end{aligned}\right. $$

This eliminates the need for additional Warmup settings and provides greater adaptivity.

For optimizers like Adam, similar to 《How Should the Learning Rate Change When Batch Size Increases?》, we can perform an approximate analysis by setting $\boldsymbol{u}_t=\text{sign}(\boldsymbol{g}_t)$. In this case,

$$ \Delta \mathcal{L} = -\eta\, \text{sign}(\boldsymbol{g}_t)\cdot \boldsymbol{g}_t = -\eta\, \Vert\boldsymbol{g}_t\Vert_1 $$

Here, $\Vert\Vert_1$ is the L1 norm, which is the sum of the absolute values of the components. Since gradient components are generally less than 1, $\Vert\boldsymbol{g}_t\Vert_1 \gg \Vert\boldsymbol{g}_t\Vert$. Therefore, also due to the need for stable training, Adam’s learning rate is usually significantly smaller than SGD’s learning rate. Furthermore, the above equation can be rewritten as

$$ \Delta \mathcal{L} = -\eta\, \text{sign}(\boldsymbol{g}_t)\cdot \boldsymbol{g}_t = -\eta\, \sqrt{N}\Vert\boldsymbol{g}_t\Vert \cos(\text{sign}(\boldsymbol{g}_t), \boldsymbol{g}_t) $$

This assumes that $\boldsymbol{g}_t$ has no zero components, so $\Vert\text{sign}(\boldsymbol{g}_t)\Vert=\sqrt{N}$, where $N$ is the total number of model parameters. In practice, it’s found that $\Vert\boldsymbol{g}_t\Vert$ and $\cos(\text{sign}(\boldsymbol{g}_t), \boldsymbol{g}_t)$ are roughly constant across different model scales. Therefore, to keep $\Delta \mathcal{L}$ constant, $\eta$ should be inversely proportional to $\sqrt{N}$. This means if the number of model parameters increases by 4 times, the learning rate can be considered halved.

Summary (formatted)
#

This article presents some of the author’s views and thoughts on the phenomenon of “the default norm length for gradient clipping being 1.”

@online{kexuefm-10657,
        title={Why is the default norm length for gradient clipping 1?},
        author={苏剑林},
        year={2025},
        month={01},
        url={\url{https://kexue.fm/archives/10657}},
}

What Is It?#

Why?#

What to Do?#

Summary (formatted)#

What Is It?
#

Why?
#

What to Do?
#

Summary (formatted)
#