Skip to main content

Thoughts on Dimension Averaging Strategies for Non-Square Matrices in Initialization Methods

·903 words
Table of Contents

This is a gemini-2.5-flash translation of a Chinese article.

It has NOT been vetted for errors. You should have the original article open in a parallel tab at all times.

In articles such as “Understanding Model Parameter Initialization Strategies from a Geometric Perspective” and “Brief Discussion on Transformer Initialization, Parameterization, and Normalization”, we have discussed model initialization methods. The general idea is that if an $n\times n$ square matrix is initialized with independent and identically distributed (i.i.d.) values with a mean of 0 and a variance of $1/n$, it approximates an orthogonal matrix, allowing the data’s second moment (or variance) to roughly remain constant during propagation.

What if it’s an $m\times n$ non-square matrix? The common approach (Xavier initialization) is to consider both forward and backward propagation comprehensively, hence initializing with i.i.d. values with a mean of 0 and a variance of $2/(m+n)$. However, this averaging is more of a ‘rule of thumb’; this article will explore whether there are better averaging schemes.

Basic Review
#

Xavier initialization considers the following fully connected layer (assume input node count is $m$, output node count is $n$):

$$ y_j = b_j + \sum_i x_i w_{i,j} $$

Here, $b_j$ is generally initialized to 0, and the initialization mean of $w_{i,j}$ is also typically 0. In “Brief Discussion on Transformer Initialization, Parameterization, and Normalization”, we have already calculated that:

$$ \mathbb{E}[y_j^2] = \sum_{i} \mathbb{E}[x_i^2] \mathbb{E}[w_{i,j}^2]= m\mathbb{E}[x_i^2]\mathbb{E}[w_{i,j}^2] $$

So, to keep the second moment constant, we set the initialization variance of $w_{i,j}$ to $1/m$ (when the mean is 0, the variance equals the second moment).

However, this derivation only considers forward propagation. We also need to ensure the model has reasonable gradients, which means the model must remain stable during backward propagation as well. Assuming the model’s loss function is $l$, according to the chain rule we have:

$$ \frac{\partial l}{\partial x_i} = \sum_j \frac{\partial l}{\partial y_j} \frac{\partial y_j}{\partial x_i}=\sum_j \frac{\partial l}{\partial y_j} w_{i,j} $$

Note that here we sum over $j$, and the dimension of summation is $n$. So, under the same assumptions, we have:

$$ \mathbb{E}\left[\left(\frac{\partial l}{\partial x_i}\right)^2\right] = \sum_{j} \mathbb{E}\left[\left(\frac{\partial l}{\partial y_j}\right)^2\right] \mathbb{E}[w_{i,j}^2]= n \mathbb{E}\left[\left(\frac{\partial l}{\partial y_j}\right)^2\right]\mathbb{E}[w_{i,j}^2] $$

Therefore, to keep the second moment constant during backward propagation, we set the initialization variance of $w_{i,j}$ to $1/n$.

One is $1/m$, the other is $1/n$. When $m\neq n$, there is a conflict. However, both are equally important, so Xavier initialization directly averages the two dimensions, initializing with a variance of $2/(m+n)$.

Geometric Mean
#

Now let’s consider two composite fully connected layers (temporarily ignoring bias terms):

$$ y = xW_1 W_2 $$

where $x\in\mathbb{R}^m,W_1\in\mathbb{R}^{m\times n},W_2\in\mathbb{R}^{n\times m}$. In other words, the input is $m$-dimensional, transformed to $n$-dimensions, and then transformed back to $m$-dimensions. Similar operations include, for example, the FFN layer in BERT (though the FFN layer has an activation function in between).

According to forward propagation stability, we should initialize $W_1$ with a variance of $1/m$ and $W_2$ with a variance of $1/n$. But what if we require $W_1$ and $W_2$ to be initialized with the same variance? Then, it’s clear that to ensure the variance of $x,y$ remains unchanged, both $W_1$ and $W_2$ need to be initialized with a distribution having a variance of $1/\sqrt{mn}$. If we consider backward propagation, the result is the same.

Thus, we derive a new dimension averaging strategy: the geometric mean $\sqrt{mn}$. With this dimension averaging strategy, we can ensure that when multiple layers are composited, if the input and output dimensions remain unchanged, the variance will stay constant (regardless of forward or backward propagation). However, if it’s an arithmetic mean $(m+n)/2$, assuming $m < n$, then according to $(m+n)^2/4\geq mn$, the variance will shrink during forward/backward propagation.

Quadratic Mean
#

Another perspective is to consider this as a dual minimization problem: Assuming the chosen variance is $t$, we want $(mt-1)^2$ to be as small as possible during forward propagation, and $(nt-1)^2$ to be as small as possible during backward propagation. So, considering both comprehensively:

$$ (mt-1)^2 + (nt-1)^2 $$

When $t=(m+n)/(m^2+n^2)$, the above expression reaches its minimum. This gives us a quadratic fractional averaging scheme: $(m^2+n^2)/(m+n)$.

It can be easily proven that:

$$ \frac{m^2+n^2}{m+n} \geq \frac{m+n}{2}\geq \sqrt{mn} $$

From the derivation process, the quadratic mean on the left aims to keep the variance as constant as possible at each step of forward and backward propagation, so it can be considered a local optimum solution; the geometric mean on the right aims to keep the variance of the “initial input” and “final output” as constant as possible, so it can be considered a global optimum solution in a sense; and the arithmetic mean in the middle is a solution that lies between the global and local optima.

Given this, it seems that Xavier initialization’s ‘rule of thumb’ arithmetic mean might not be a bad choice after all, perhaps representing a ‘middle ground’?

Summary (formatted)
#

This article briefly explored dimension averaging schemes for non-square matrices in initialization methods. For a long time, there seemed to be no questions regarding the default arithmetic mean. However, I have derived possibilities for different averaging strategies from two different perspectives. As for which averaging strategy is better, I haven’t conducted detailed experiments. Interested readers are welcome to try them out. Of course, it’s also possible that with the numerous optimization strategies currently available, the default initialization scheme already works very well, making fine-tuning unnecessary.

@online{kexuefm-8725,
        title={Thoughts on Dimension Averaging Strategies for Non-Square Matrices in Initialization Methods},
        author={苏剑林},
        year={2021},
        month={10},
        url={\url{https://kexue.fm/archives/8725}},
}