What Does BN Actually Do? An Analysis From First Principles

This is a gemini-2.5-flash translation of a Chinese article.

It has NOT been vetted for errors. You should have the original article open in a parallel tab at all times.

By Su Jianlin | 2019-10-11 | 151859 Readers

BN, short for Batch Normalization, is a rather important technique in current deep learning models (especially vision-related models). It can accelerate training, even has some anti-overfitting effect, and allows us to use larger learning rates. In general, it offers many benefits (provided you can afford a large batch size).

So, how exactly does BN work? Early explanations were primarily based on probability distributions, roughly meaning that it normalizes the input distribution of each layer to $\mathcal{N}(0,1)$, reducing the so-called Internal Covariate Shift, thereby stabilizing and even accelerating training. This explanation seems plausible at first glance, but upon closer inspection, it has issues: the input to any layer can never strictly satisfy a normal distribution, thus simply standardizing the mean and variance cannot achieve the standard distribution $\mathcal{N}(0,1)$; furthermore, even if it could achieve $\mathcal{N}(0,1)$, this interpretation cannot further explain why other normalization methods (such as Instance Normalization, Layer Normalization) work.

In last year’s paper, 《How Does Batch Normalization Help Optimization?》, the authors explicitly raised the above doubts, refuted some of the original views, and proposed their new understanding of BN: they believe that BN’s main role is to make the landscape of the entire loss function smoother, thereby allowing for more stable training.

This blog post primarily shares the conclusions of that paper, but the argumentation method was conceived by the author ‘from first principles’ (lit. ‘closed-door’). I humbly believe that the original paper’s arguments are too obscure, especially the mathematical parts, making them difficult to understand. Therefore, this article attempts to express the same viewpoint as intuitively as possible.

(Note: Before reading this article, please ensure you clearly understand what BN is. This article will not repeat the introduction of BN’s concept and process.)

Some Basic Conclusions
#

In this section, we will first present a core inequality, then derive gradient descent, and obtain some basic conclusions about model training, laying the groundwork for the subsequent analysis of BN.

Core Inequality
#

Assume that the gradient of function $f(\theta)$ satisfies the Lipschitz constraint (L-constraint), i.e., there exists a constant $L$ such that the following always holds:

$$ \Vert \nabla_{\theta} f(\theta + \Delta \theta) - \nabla_{\theta} f(\theta)\Vert_2\leq L\Vert \Delta\theta\Vert_2 $$

Then we have the following inequality:

$$ f(\theta+\Delta\theta) \leq f(\theta) + \left\langle \nabla_{\theta}f(\theta), \Delta\theta\right\rangle + \frac{1}{2}L \Vert \Delta\theta\Vert_2^2 $$

The proof is not difficult. Define an auxiliary function $f(\theta + t\Delta\theta), t\in[0, 1]$, then directly obtain:

$$ \begin{aligned}f(\theta + \Delta\theta) - f(\theta)=&\int_0^1\frac{\partial f(\theta + t\Delta\theta)}{\partial t} dt\\ =&\int_0^1\left\langle\nabla_{\theta} f(\theta + t\Delta\theta), \Delta\theta\right\rangle dt\\ =&\left\langle\nabla_{\theta} f(\theta), \Delta\theta\right\rangle + \int_0^1\left\langle\nabla_{\theta} f(\theta + t\Delta\theta) - \nabla_{\theta} f(\theta), \Delta\theta\right\rangle dt\\ \leq&\left\langle\nabla_{\theta} f(\theta), \Delta\theta\right\rangle + \int_0^1\Vert\nabla_{\theta} f(\theta + t\Delta\theta) - \nabla_{\theta} f(\theta)\Vert_2 \cdot \Vert \Delta\theta\Vert_2 dt\\ \leq&\left\langle\nabla_{\theta} f(\theta), \Delta\theta\right\rangle + \int_0^1 L \Vert \Delta\theta\Vert_2^2 t dt\\ = &\left\langle\nabla_{\theta} f(\theta), \Delta\theta\right\rangle + \frac{1}{2} L \Vert \Delta\theta\Vert_2^2 \end{aligned} $$

Gradient Descent
#

Suppose $f(\theta)$ is the loss function, and our goal is to minimize $f(\theta)$. This inequality tells us a lot. First, since it’s minimization, we naturally hope that each step decreases the function, i.e., $f(\theta+\Delta\theta) < f(\theta)$. Since $\frac{1}{2}L \Vert \Delta\theta\Vert_2^2$ must be non-negative, the only way to achieve a decrease is if $\left\langle \nabla_{\theta}f(\theta), \Delta\theta\right\rangle < 0$. A natural choice then is:

$$ \Delta\theta = -\eta \nabla_{\theta}f(\theta) $$

Here, $\eta > 0$ is a scalar, which is the learning rate.

It can be seen that this formula is the update rule for gradient descent. Thus, this is also a derivation of gradient descent, and this derivation process contains even richer information because it is a strict inequality, allowing it to provide us with some conclusions about training.

Lipschitz Constraint
#

Substituting the gradient descent formula into the core inequality, we get:

$$ f(\theta+\Delta\theta) \leq f(\theta) + \left(\frac{1}{2}L\eta^2 - \eta\right) \Vert \nabla_{\theta}f(\theta)\Vert_2^2 $$

Note that a sufficient condition to ensure the loss function decreases is $\frac{1}{2}L\eta^2 - \eta < 0$. To achieve this, either $\eta$ must be sufficiently small, or $L$ must be sufficiently small. However, a sufficiently small $\eta$ means the learning speed will be quite slow. Therefore, the more ideal situation is for $L$ to be small enough; lowering $L$ allows for a larger learning rate, which speeds up learning, and this is one of its benefits.

However, $L$ is an intrinsic property of $f(\theta)$, so it can only be reduced by adjusting $f$ itself.

How BN Was Made
#

This section will demonstrate how aiming to reduce the $L$-constant of the neural network’s gradient can naturally lead to BN. In other words, BN reduces the $L$-constant of the neural network’s gradient, making neural network learning easier, for instance, by allowing the use of larger learning rates. Intuitively, reducing the $L$-constant of the gradient means making the loss function less ‘volatile’ or making its landscape smoother.

Note: We have previously discussed the $L$-constraint. Earlier, we discussed neural networks satisfying the $L$-constraint with respect to “inputs,” which led to spectral regularization and spectral normalization of weights (please refer to 《Lipschitz Constraint in Deep Learning: Generalization and Generative Models》). This article, however, discusses neural networks (specifically their gradients) satisfying the $L$-constraint with respect to “parameters,” which leads to various input normalization methods, with BN being one of the most natural.

Gradient Analysis
#

Taking supervised learning as an example, suppose the neural network is represented as $\hat{y}=h(x;\theta)$, and the loss function is $l(y,\hat{y})$. Then what we need to do is:

$$ \theta = \mathop{\text{argmin}}_{\theta}\, \mathbb{E}_{(x,y)\sim p(x,y)}\left[l(y, h(x;\theta))\right] $$

That is, $f(\theta)=\mathbb{E}_{(x,y)\sim p(x,y)}\left[l(y, h(x;\theta))\right]$, so:

$$ \begin{aligned}\nabla_{\theta}f(\theta)=&\mathbb{E}_{(x,y)\sim p(x,y)}\left[\nabla_{\theta}l(y, h(x;\theta))\right]\\ =&\mathbb{E}_{(x,y)\sim p(x,y)}\left[\nabla_{h}l(y, h(x;\theta))\nabla_{\theta}h(x;\theta)\right]\end{aligned} $$

By the way, none of the notations in this article are bolded, but depending on the context, they may represent either scalars or vectors.

Non-linear Hypothesis
#

Evidently, $f(\theta)$ is a nonlinear function, and its nonlinearity stems from two sources:

The loss function $l(y,\hat{y})$ is generally nonlinear;
The activation functions in the neural network $h(x;\theta)$ are nonlinear.

Regarding activation functions, most mainstream activation functions currently satisfy a characteristic: the absolute value of their derivative does not exceed a certain constant. We now consider whether this characteristic can be extended to the loss function, i.e., whether the gradient of the loss function $\nabla_{h}l(y, h(x;\theta))$ (throughout the entire training process) will be confined within a certain range.

Superficially, this assumption usually does not hold. For example, cross-entropy is $-\log p$, and its derivative is $-1/p$, which clearly cannot be constrained within a finite range. However, when the loss function is considered together with the activation function of the last layer, this constraint is usually satisfied. For instance, in binary classification, the last layer typically uses a sigmoid activation. When combined with cross-entropy, this becomes:

$$ -\log \text{sigmoid}(h(x;\theta)) = \log \left(1 + e^{-h(x;\theta)}\right) $$

At this point, its gradient with respect to $h$ is between -1 and 1. Of course, there are indeed some cases where this does not hold. For example, regression problems typically use MSE as the loss function, and the last layer usually does not have an activation function. In this case, its gradient is a linear function and will not be confined within a finite range. Under such circumstances, we can only hope for good model initialization and a good optimizer, such that $\nabla_{h}l(y, h(x;\theta))$ remains relatively stable throughout the training process. This ‘hope’ may seem strong, but neural networks that successfully train generally meet this ‘hope’.

Cauchy-Schwarz Inequality
#

Our goal is to explore the extent to which $\nabla_{\theta}f(\theta)$ satisfies the $L$-constraint, and to discuss methods to reduce this $L$. For this purpose, let’s first consider the simplest single-layer neural network (input vector, scalar output) $h(x;w,b)=g\left(\left\langle x, w\right\rangle + b\right)$, where $g$ is the activation function. In this case:

$$ \begin{aligned}\mathbb{E}_{(x,y)\sim p(x,y)}\left[\nabla_{b}f(w,b)\right]&=\mathbb{E}_{(x,y)\sim p(x,y)}\left[\frac{\partial l_{w,b}}{\partial g}\dot{g}\left(\left\langle x, w\right\rangle + b\right)\right]\\ \mathbb{E}_{(x,y)\sim p(x,y)}\left[\nabla_{w}f(w,b)\right]&=\mathbb{E}_{(x,y)\sim p(x,y)}\left[\frac{\partial l_{w,b}}{\partial g}\dot{g}\left(\left\langle x, w\right\rangle + b\right)x\right] \end{aligned} $$

Based on our assumption, both $\frac{\partial l_{w,b}}{\partial g}$ and $\dot{g}\left(\left\langle x, w\right\rangle + b\right)$ are limited within a certain range. Therefore, it can be seen that the gradient of the bias term $b$ is very stable, and its update should also be very stable. However, the gradient of $w$ is different; it is directly related to the input $x$.

Regarding the gradient difference for $w$, we have:

$$ \begin{aligned} &\big\Vert\mathbb{E}_{(x,y)\sim p(x,y)}\left[\nabla_{w}f(w+\Delta w,b)\right] - \mathbb{E}_{(x,y)\sim p(x,y)}\left[\nabla_{w}f(w,b)\right]\big\Vert_2\\ =&\Bigg\Vert\mathbb{E}_{(x,y)\sim p(x,y)}\left[\left(\frac{\partial l_{w+\Delta w,b}}{\partial g}\dot{g}\left(\left\langle x, w+\Delta w\right\rangle + b\right) - \frac{\partial l_{w,b}}{\partial g}\dot{g}\left(\left\langle x, w\right\rangle + b\right)\right)x\right]\Bigg\Vert_2 \end{aligned} $$

Let the parenthesized part be denoted as $\lambda(x, y; w,b,\Delta w)$. According to the previous discussion, it is constrained within a certain range, and this part remains a stable term. Given this, we might as well assume it naturally satisfies the $L$-constraint, i.e.:

$$ \Vert\lambda(x, y; w,b,\Delta w)\Vert_2=\mathcal{O}\left(\Vert\Delta w\Vert_2\right) $$

At this point, we only need to focus on the additional $x$. According to the Cauchy-Schwarz inequality, we have:

$$ \begin{aligned}&\Big\Vert\mathbb{E}_{(x,y)\sim p(x,y)}\left[\lambda(x, y; w,b,\Delta w) x\right]\Big\Vert_2\\ \leq & \sqrt{\mathbb{E}_{(x,y)\sim p(x,y)}\left[\lambda(x, y; w,b,\Delta w)^2\right]}\times \sqrt{\big|\mathbb{E}_{x\sim p(x)}\left[x\otimes x\right]\big|_1} \end{aligned} $$

In this way, we obtain $\big|\mathbb{E}_{x\sim p(x)}\left[x\otimes x\right]\big|_1$, which is independent of the (current layer’s) parameters. If we want to reduce the $L$-constant, the most direct method is to reduce this term.

Subtract Mean and Divide by Standard Deviation
#

It should be noted that while we strongly desire to reduce the $L$-constant of the gradient, there is a prerequisite: it must not significantly reduce the original neural network’s fitting capability. Otherwise, simply multiplying by 0 would reduce $L$ to 0, which would be meaningless.

The result from the previous inequality tells us that finding a way to reduce $\big|\mathbb{E}_{x\sim p(x)}\left[x\otimes x\right]\big|_1$ is a direct approach, which means we need to transform the input $x$. Then, based on the aforementioned prerequisite of ’not reducing fitting capability,’ the simplest and potentially effective method is a translation transformation, i.e., we consider $x \to x - \mu$. In other words, we consider an appropriate $\mu$ such that:

$$ \big|\mathbb{E}_{x\sim p(x)}\left[(x-\mu)\otimes (x-\mu)\right]\big|_1 $$

is minimized. This is simply a quadratic function minimization problem, and it’s not difficult to solve for the optimal $\mu$, which is:

$$ \mu = \mathbb{E}_{x\sim p(x)}\left[x\right] $$

This is exactly the mean of all samples. Thus, we get:

Conclusion 1: Subtracting the mean of all samples from the input can reduce the $L$-constant of the gradient, which is an operation beneficial for optimization without reducing the neural network’s fitting capability.

Next, we consider a scaling transformation, i.e., $x - \mu \to \frac{x - \mu}{\sigma}$, where $\sigma$ is a vector of the same size as $x$, and the division is element-wise. This leads to:

$$ \big|\mathbb{E}_{x\sim p(x)}\left[(x-\mu)\otimes (x-\mu)\right]\big|_1 \to \left|\frac{\mathbb{E}_{x\sim p(x)}\left[(x-\mu)\otimes (x-\mu)\right]}{\sigma\otimes \sigma}\right|_1 $$

$\sigma$ is the most direct scaling factor for $L$. But the question is, what is the best scaling target? If one blindly pursues a smaller $L$, then simply setting $\sigma \to \infty$ would suffice, but such a neural network would completely lose its fitting capability. However, if $\sigma$ is too small, leading to an excessively large $L$, then it is not conducive to optimization. Therefore, we need a standard.

What makes a good standard? Let’s go back to the gradient expression. As mentioned earlier, the gradient of the bias term is not significantly affected by $x$, so it seems like a reliable standard. If this is the case, then it’s equivalent to scaling the weight of the input $x$ term directly to 1. In other words, $\frac{\mathbb{E}_{x\sim p(x)}\left[(x-\mu)\otimes (x-\mu)\right]}{\sigma\otimes \sigma}$ becomes an all-ones vector, or to put it differently:

$$ \sigma = \sqrt{\mathbb{E}_{x\sim p(x)}\left[(x-\mu)\otimes (x-\mu)\right]} $$

This way, a relatively natural choice is to set $\sigma$ as the standard deviation of the input. At this point, we can feel that dividing by the standard deviation acts more like an adaptive learning rate correction term. To some extent, it eliminates the differences in parameter optimization caused by inputs at different layers, making the optimization of the entire network more ‘synchronous,’ or making each layer of the neural network more ’equal,’ thereby utilizing the entire neural network more fully and reducing the possibility of overfitting at a particular layer. Of course, when the input magnitude is too large, dividing by the standard deviation also helps to reduce the $L$-constant of the gradient.

Thus, we have the conclusion:

Conclusion 2: Dividing the input (after subtracting the mean of all samples) by the standard deviation of all samples has an effect similar to an adaptive learning rate, making updates in each layer more synchronous, reducing the possibility of overfitting in a specific layer, and is an operation that improves neural network performance.

As Deduction Ends, BN Appears
#

Although the previous derivations superficially used only a single-layer neural network (input vector, scalar output) as an example, the conclusions are sufficiently representative, because multi-layer neural networks are essentially just compositions of single-layer neural networks (for this argument, please refer to the author’s earlier work 《From Boosting to Neural Networks: Is a Mountain a Mountain?》).

So, with the previous two conclusions, BN can basically be implemented: during training, the input of each layer simply needs to be subtracted by its mean and divided by its standard deviation. However, since each batch is only an approximation of the whole, and the expectations for mean and standard deviation are those of all samples, BN inherently performs better with a larger batch size, which places demands on computational power. Furthermore, a conclusion from this analysis process is: BN should be placed before fully connected/convolutional layers.

In addition, we need to maintain a set of variables to store the mean and variance calculated during training for use during inference. These are the mean and variance variables statistically gathered through moving averages in BN. As for the $\beta, \gamma$ terms added in the standard BN design after subtracting the mean and dividing by the standard deviation, I believe they are merely icing on the cake, not strictly necessary, so I cannot provide much further explanation.

Summary (formatted)
#

This article analyzed the working principle of BN from an optimization perspective. The views presented are generally consistent with 《How Does Batch Normalization Help Optimization?》, but the mathematical argumentation and description style, in my personal opinion, are simpler and easier to understand. The ultimate conclusion is that the term involving subtracting the mean helps to reduce the $L$-constant of the neural network’s gradient, while the term involving dividing by the standard deviation acts more like an adaptive learning rate, making the updates of each parameter more synchronous, and preventing overfitting to a specific layer or parameter.

Of course, the above interpretation is just a rough guide. Fully explaining BN is a very difficult task. BN’s effect is more like a composite result of various factors. For example, for most mainstream activation functions, [-1, 1] is generally a region with strong nonlinearity. Therefore, making the input have a mean of 0 and a variance of 1 can also more fully leverage the activation function’s nonlinear capabilities, preventing an undue waste of the neural network’s fitting capacity.

In summary, the theoretical analysis of neural networks is a very difficult undertaking, far beyond the author’s capabilities. I can only write a blog here, telling some perhaps inconsequential stories to amuse everyone.

@online{kexuefm-6992,
        title={What Does BN Actually Do? An Analysis From First Principles},
        author={苏剑林},
        year={2019},
        month={10},
        url={\url{https://kexue.fm/archives/6992}},
}

Some Basic Conclusions#

Core Inequality#

Gradient Descent#

Lipschitz Constraint#

How BN Was Made#

Gradient Analysis#

Non-linear Hypothesis#

Cauchy-Schwarz Inequality#

Subtract Mean and Divide by Standard Deviation#

As Deduction Ends, BN Appears#

Summary (formatted)#