Understanding Model Parameter Initialization Strategies from a Geometric Perspective

Table of Contents

This is a gemini-2.5-flash translation of a Chinese article.

It has NOT been vetted for errors. You should have the original article open in a parallel tab at all times.

By Su Jianlin | 2020-01-16 | 135,259 Readers

For complex models, parameter initialization is particularly important. Poor initialization often leads not just to degraded model performance, but more likely to the model being unable to train at all or failing to converge. In deep learning, a common adaptive initialization strategy is Xavier initialization, which consists of initial weights randomly sampled from a normal distribution $\mathcal{N}\left(0,\frac{2}{fan_{in} + fan_{out}}\right)$, where $fan_{in}$ is the input dimension and $fan_{out}$ is the output dimension. Other initialization strategies are largely similar, differing only in their assumptions, which leads to slight variations in their final forms.

The derivation of standard initialization strategies is based on probability and statistics, with the general idea being to assume input data has a mean of 0 and variance of 1, then expect output data to also maintain a mean of 0 and variance of 1, and subsequently derive the mean and variance conditions that the initial transformation should satisfy. This process is theoretically sound, but in my opinion, it’s still not intuitive enough, and the derivation involves quite a few assumptions. This article aims to understand model initialization methods from a geometric perspective, providing a more intuitive derivation process.

Readily Available Orthogonality
#

Previously, I wrote “The Distribution of Angles Between Two Random Vectors in n-Dimensional Space”, one of the corollaries of which is:

Corollary 1: Any two random vectors in high-dimensional space are almost perpendicular.

In fact, Corollary 1 is the starting point for the entire geometric perspective presented in this article! A further corollary of it is:

Corollary 2: Randomly selecting $n^2$ numbers from $\mathcal{N}(0, 1/n)$ to form an $n\times n$ matrix results in a matrix that is approximately orthogonal, with the approximation improving as $n$ increases.

Skeptical readers can also verify this numerically:

import numpy as np

n = 100
W = np.random.randn(n, n) / np.sqrt(n)
X = np.dot(W.T, W)  # Matrix multiplied by its own transpose
print(X)  # Check if it's close to the identity matrix
print(np.square(X - np.eye(n)).mean())  # Calculate the MSE with the identity matrix

I believe that for most readers, seeing Corollary 2 for the first time will be more or less surprising. An orthogonal matrix is a matrix $\boldsymbol{W}$ that satisfies $\boldsymbol{W}^{\top}\boldsymbol{W}=\boldsymbol{I}$, meaning its inverse is equal to its transpose. The difficulty of solving for the inverse and the transpose of a general matrix is not even comparable. Thus, we tend to feel that “inverse = transpose” should be a very strict condition. However, Corollary 2 tells us that a matrix obtained by random sampling is already close to an orthogonal matrix, which is somewhat counter-intuitive. When I first realized this, I was quite surprised too.

Actually, It’s Not That Hard to Understand
#

However, once we get used to the fact stated in Corollary 1, “any two random vectors in high-dimensional space are almost perpendicular,” we can indeed quickly understand and derive this result. For a quick derivation, we can first consider the standard normal distribution $\mathcal{N}(0,1)$. Note that Corollary 1 requires uniform sampling direction, which the standard normal distribution fulfills. Sampling an $n\times n$ matrix from $\mathcal{N}(0,1)$, we can view it as $n$ n-dimensional vectors. Since these $n$ vectors are random vectors, they naturally become nearly orthogonal to each other.

Of course, pairwise orthogonality alone does not make an orthogonal matrix, because an orthogonal matrix also requires each vector to have a length (magnitude) of 1. We know that $\mathbb{E}_{x\sim \mathcal{N}(0,1)}\left[x^2\right]=1$, which means that an $n$-dimensional vector sampled from $\mathcal{N}(0,1)$ has a magnitude of approximately $\sqrt{n}$. Therefore, to approach orthogonality, each element must be divided by $\sqrt{n}$, which is equivalent to changing the sampling variance from 1 to $1/n$.

Furthermore, the sampling distribution does not necessarily have to be a normal distribution; for example, a uniform distribution $U\left[-\sqrt{3/n}, \sqrt{3/n}\right]$ also works. In fact, we have:

Corollary 3: An $n\times n$ matrix, independently sampled from any distribution $p(x)$ with mean 0 and variance $1/n$, will be approximately an orthogonal matrix.

We can understand Corollary 3 from a more mathematical perspective: Assume $\boldsymbol{x}=(x_1,x_2,\dots,x_n)$ and $\boldsymbol{y}=(y_1,y_2,\dots,y_n)$ are both sampled from $p(x)$. Then we have:

$$ \begin{aligned} \langle \boldsymbol{x}, \boldsymbol{y}\rangle =&\, n\times \frac{1}{n}\sum_{k=1}^n x_k y_k\\ \approx&\, n\times \mathbb{E}_{x\sim p(x),y\sim p(x)}[xy]\\ =&\, n\times \mathbb{E}_{x\sim p(x)}[x]\times \mathbb{E}_{y\sim p(x)}[y]\\ =&\,0 \end{aligned} $$

and

$$ \begin{aligned} \Vert\boldsymbol{x}\Vert^2 =&\, n\times \frac{1}{n}\sum_{k=1}^n x_k^2\\ \approx&\, n\times \mathbb{E}_{x\sim p(x)}\left[x^2\right]\\ =&\, n\times \left(\mu^2 + \sigma^2\right)\\ =&\,1 \end{aligned} $$

Thus, any two vectors are approximately orthonormal, and therefore, the sampled matrix is also approximately an orthogonal matrix.

Now We Can Talk About Initialization
#

Having discussed so much about orthogonal matrices, essentially, it all serves as a foundation for understanding the geometric meaning of initialization methods. If readers still recall linear algebra, they should remember that the significant importance of orthogonal matrices lies in their ability to preserve the magnitude (length) of vectors during transformations. Expressed mathematically, let $\boldsymbol{W}\in \mathbb{R}^{n\times n}$ be an orthogonal matrix, and $\boldsymbol{x}\in\mathbb{R}^n$ be an arbitrary vector. Then the magnitude of $\boldsymbol{x}$ is equal to the magnitude of $\boldsymbol{W}\boldsymbol{x}$:

$$ \Vert\boldsymbol{W}\boldsymbol{x}\Vert^2 = \boldsymbol{x}^{\top}\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{x}=\boldsymbol{x}^{\top}\boldsymbol{x}=\Vert\boldsymbol{x}\Vert^2 $$

Consider a fully connected layer:

$$ \boldsymbol{y}=\boldsymbol{W}\boldsymbol{x} + \boldsymbol{b} $$

Deep learning models are essentially nested fully connected layers. Therefore, to prevent the model’s final output from becoming overly “inflated” or “degraded” during the initialization phase, one idea is to ensure that the model preserves vector magnitudes upon initialization.

This idea naturally leads to an initialization strategy: “initialize $\boldsymbol{b}$ with all zeros, and initialize $\boldsymbol{W}$ with a random orthogonal matrix”. And Corollary 2 has already shown us that an $n\times n$ matrix sampled from $\mathcal{N}(0, 1/n)$ is already close to an orthogonal matrix. Thus, we can sample from $\mathcal{N}(0, 1/n)$ to initialize $\boldsymbol{W}$. This is the Xavier initialization strategy, also called Glorot initialization in some frameworks, named after its author, Xavier Glorot. Furthermore, the sampling distribution doesn’t necessarily have to be $\mathcal{N}(0, 1/n)$; as mentioned in Corollary 3, you can sample from any distribution with mean 0 and variance $1/n$.

The above discusses the case where both input and output dimensions are $n$. What if the input is $n$-dimensional and the output is $m$-dimensional? In this case, $\boldsymbol{W}\in\mathbb{R}^{m\times n}$, and the condition for preserving the magnitude of $\boldsymbol{W}\boldsymbol{x}$ is still $\boldsymbol{W}^{\top}\boldsymbol{W}=\boldsymbol{I}$. However, this is impossible when $m < n$; when $m \geq n$, it is possible, and based on similar derivations as before, we can obtain:

Corollary 4: When $m \geq n$, an $m\times n$ matrix, independently sampled from any distribution $p(x)$ with mean 0 and variance $1/m$, approximately satisfies $\boldsymbol{W}^{\top}\boldsymbol{W}=\boldsymbol{I}$.

Therefore, if $m > n$, we simply need to change the variance of the sampling distribution to $1/m$. As for $m < n$, although there’s no direct derivation, this approach can still be followed, as a reasonable strategy should be generalizable. Note that this modification differs slightly from the original design of Xavier initialization. It is a dual version of “LeCun initialization” (LeCun initialization has a variance of $1/n$), while Xavier initialization uses a variance of $2/(m+n)$, which averages the intuitive approaches for forward and backward propagation. Here, we primarily consider forward propagation.

Some readers might still wonder: you’ve only considered scenarios without activation functions. Even if the magnitude of $\boldsymbol{y}$ is the same as $\boldsymbol{x}$, it will be different after $\boldsymbol{y}$ passes through an activation function. This is indeed the case, and at this point, one can only analyze specific problems individually. For example, $\tanh(x)\approx x$ when $x$ is small, so Xavier initialization can be considered directly applicable to $\tanh$ activation. Another example is $\text{relu}$: approximately half of the elements in $\text{relu}(\boldsymbol{y})$ will be set to zero, so the magnitude becomes roughly $1/\sqrt{2}$ of its original value. To preserve the magnitude, $\boldsymbol{W}$ can be multiplied by $\sqrt{2}$, meaning the initialization variance changes from $1/m$ to $2/m$. This is the initialization strategy for $\text{relu}$ proposed by the great Kaiming He.

Of course, it’s practically difficult to fine-tune the variance for every activation function. Therefore, a more general approach is to directly add a Layer Normalization-like operation after the activation function to explicitly restore the magnitude. This is where various Normalization techniques come into play. (Feel free to continue reading my previous work: “What Exactly Does Batch Normalization Do? An Analysis Behind Closed Doors”.)

Summary (formatted)
#

This article mainly derives the conclusion that “any $n\times n$ matrix with mean 0 and variance $1/n$ is approximately an orthogonal matrix” from the premise that “any two random vectors in high-dimensional space are almost perpendicular”, thereby offering a geometric perspective on related initialization strategies. I humbly believe this geometric perspective is more intuitive and easier to understand than a purely statistical one.

@online{kexuefm-7180,
        title={Understanding Model Parameter Initialization Strategies from a Geometric Perspective},
        author={苏剑林},
        year={2020},
        month={01},
        url={\url{https://kexue.fm/archives/7180}},
}

Readily Available Orthogonality#

Actually, It’s Not That Hard to Understand#

Now We Can Talk About Initialization#

Summary (formatted)#

Readily Available Orthogonality
#

Actually, It’s Not That Hard to Understand
#

Now We Can Talk About Initialization
#

Summary (formatted)
#