A Very Concise VQ Training Scheme

Table of Contents

This is a gemini-2.5-flash translation of a Chinese article.

It has NOT been vetted for errors. You should have the original article open in a parallel tab at all times.

By Su Jianlin | 2025-10-08 | 439 readers

For researchers committed to the discretization approach, VQ (Vector Quantization) is a critical component in visual understanding and generation, serving as the “Tokenizer” in vision. It was proposed in the 2017 paper 《Neural Discrete Representation Learning》, and I also introduced it in my 2019 blog post 《A Concise Introduction to VQ-VAE: Quantized Autoencoders》.

However, after so many years, we can see that VQ training techniques have remained almost unchanged, all relying on STE (Straight-Through Estimator) with additional Auxiliary Losses. STE itself isn’t problematic; it can be considered the standard way to design gradients for discrete operations. However, the presence of Auxiliary Losses always gives a feeling of not being entirely end-to-end, and it also introduces extra hyperparameters to tune.

Fortunately, this situation may be coming to an end. Last week’s paper 《DiVeQ: Differentiable Vector Quantization Using the Reparameterization Trick》 proposed a new STE trick, whose biggest highlight is that it doesn’t require Auxiliary Losses, making it exceptionally concise and elegant!

Discrete Encoding
#

As usual, let’s first review the existing VQ training schemes. It should be noted that VQ (Vector Quantization) itself is actually a very old concept, traceable back to the 1980s, originally intended to cluster vectors and replace them with their corresponding cluster centers, thereby achieving data compression.

However, the VQ we are discussing here primarily refers to the VQ in VQ-VAE, as proposed in the paper 《Neural Discrete Representation Learning》. Of course, the definition of VQ itself hasn’t changed; it’s still a mapping from vectors to cluster centers. The core of the VQ-VAE paper is to provide an end-to-end training scheme for performing VQ on latent variables and then decoding for reconstruction. The difficulty lies in the VQ step being a discrete operation, which lacks readily available gradients and thus requires designing gradients for it.

In formulaic terms, a standard AE (AutoEncoder) is:

$$ z = encoder(x),\quad \hat{x}=decoder(z),\quad \mathcal{L}=\Vert x - \hat{x}\Vert^2 $$

Where $x$ is the original input, $z$ is the encoded vector, and $\hat{x}$ is the reconstruction result. What VQ-VAE aims to do, based on the idea of VQ, is to transform $z$ into one of the entries in the codebook $E=\{e_1,e_2,\cdots,e_K\}$:

$$ q = \argmin_{e\in E} \Vert z - e\Vert $$

Then the $decoder$ uses $q$ as input for reconstruction. Since $q$ corresponds one-to-one with an index in the codebook, $q$ is effectively an integer encoding of $x$. Of course, to ensure reconstruction quality, in practice, it’s certainly not encoded into a single vector, but rather multiple vectors, so after VQ, it becomes multiple integers. Therefore, what VQ-VAE aims to do is to encode the input into a sequence of integers, which is analogous to a text tokenizer.

Gradient Design
#

Now, the modules we need to train include the $encoder$, $decoder$, and the codebook $E$. Because the VQ operation involves an $\argmin$ calculation, the gradient path breaks at $q$, preventing it from being propagated back to the $encoder$.

VQ-VAE uses a technique called STE, which states that what is fed into the $decoder$ is still the post-VQ $q$, but during backpropagation to compute gradients, we use the pre-VQ $z$. This allows gradients to be propagated back to the $encoder$, which can be implemented using the stop_gradient operator ($\sg$):

$$ z = encoder(x),\quad z_q = z + \sg[q - z],\quad q = \argmin_{e\in E} \Vert z - e\Vert,\quad \hat{x} = decoder(z_q) $$

Simply put, STE achieves the effect where $z_q=q$ but $\nabla z_q = \nabla z$. This gives the $encoder$ gradients, but $q$ has no gradient, making it impossible to optimize the codebook. To solve this problem, VQ-VAE adds two Auxiliary Losses:

$$ \mathcal{L} = \Vert x - \hat{x}\Vert^2 + \beta\Vert q - \sg[z]\Vert^2 + \gamma\Vert z - \sg[q]\Vert^2 $$

These two loss terms respectively represent $q$ moving closer to $z$ and $z$ moving closer to $q$, consistent with the original idea of VQ. The combination of STE and these two Auxiliary Losses forms the standard VQ-VAE. Additionally, there is a simple variant that directly sets $\beta=0$ but updates the codebook using an exponential moving average of $z$. This is also equivalent to specifying SGD to update the Auxiliary Loss for $q$.

It’s worth noting here that although VQ-VAE was named “VAE” by the original paper, it is essentially an AE, so “VQ-AE” would be more appropriate in principle. However, since the name has become widely adopted, we will continue to use it. Later VQGAN built upon VQ-VAE by stacking techniques like GAN Loss to improve reconstruction clarity.

Alternative Approaches
#

For me, these two additional Auxiliary Losses are quite bothersome. Presumably, many in the industry share my sentiment, so there are occasional related improvement efforts.

Among them, the most “radical” approach is to switch to a discretization scheme other than VQ, such as FSQ introduced in 《Embarrassingly Simple FSQ: “Rounding” Surpasses VQ-VAE》, which does not require Auxiliary Losses. If VQ clusters high-dimensional vectors, then FSQ performs “rounding” on low-dimensional vectors to achieve discretization. However, I commented in this article that FSQ cannot replace VQ in all scenarios, so improving VQ itself remains valuable.

Before proposing DiVeQ, the original author had actually proposed a scheme called “NSVQ”, taking a small step towards “abolishing” Auxiliary Losses. It modifies $z_q$ to:

$$ z_q = z + \Vert q - z\Vert \times \frac{\varepsilon}{\Vert \varepsilon\Vert},\qquad \varepsilon\sim\mathcal{N}(0, I) $$

Here, $\varepsilon$ is a vector of the same size as $z,q$, and its components follow a standard normal distribution. After replacing it with this new $z_q$, due to the differentiability of $\Vert q - z\Vert$, $q$ also has gradients, so in principle, the codebook can be trained without Auxiliary Losses. The geometric meaning of NSVQ is very intuitive: it samples uniformly on a “circle centered at $z$ with radius $\Vert q-z\Vert$”. The drawback is that the input to the $decoder$ is not $q$ in this case, and during inference, we care about the reconstruction quality of $q$. Therefore, NSVQ has an inconsistency between training and inference.

Introducing the Protagonist
#

Starting from NSVQ, if we want to maintain $q$ for the forward pass while retaining the gradients provided by $\Vert q - z\Vert$, then an improved version can be easily proposed:

$$ z_q = z + \Vert q - z\Vert \times \sg\left[\frac{q - z}{\Vert q - z\Vert}\right] $$

In the forward pass, it strictly holds that $z_q = q$, but in the backward pass, it retains the gradients of $z$ and $\Vert q - z\Vert$. This is “DiVeQ-detach” from the appendix of the DiVeQ paper, while the DiVeQ in the main text can be considered a kind of interpolation between equation (2) and equation (1):

$$ z_q = z + \Vert q - z\Vert \times \sg\left[\frac{q - z + \varepsilon}{\Vert q - z + \varepsilon\Vert}\right],\qquad \varepsilon\sim\mathcal{N}(0, \sigma^2 I) $$

Clearly, when $\sigma=0$, the result is “DiVeQ-detach”, and when $\sigma\to\infty$, the result is “NSVQ”. The appendix of the original paper searched for $\sigma$, and the general conclusion is that $\sigma^2 = 10^{-3}$ is a generally optimal choice.

The experimental results of the paper show that although equation (3) introduces randomness and also has a certain degree of training-inference inconsistency, its performance is better compared to equation (2). However, from my aesthetic perspective, performance should not come at the expense of elegance. Therefore, the “DiVeQ-detach” of equation (2) is my ideal scheme, and all references to DiVeQ in the following analysis will refer to “DiVeQ-detach”.

Theoretical Analysis
#

Regrettably, the original paper does not contain much theoretical analysis. Therefore, in this section, I will attempt to provide a basic analysis of DiVeQ’s effectiveness, its relevance to the original VQ training scheme, and other aspects. First, consider the general form of equation (2):

$$ z_q = z + r(q, z) \times \sg\left[\frac{q - z}{r(q, z)}\right] $$

Where $r(q,z)$ is an arbitrary differentiable scalar function of $q,z$, which can be regarded as any distance function between $q$ and $z$. Let’s denote the loss function as $\mathcal{L}(z_q)$, then its differential is:

$$ d\mathcal{L} = \langle\nabla_{z_q} \mathcal{L},d z_q\rangle = \left\langle\nabla_{z_q} \mathcal{L},dz + dr \times\frac{q-z}{r}\right\rangle = \langle\nabla_{z_q} \mathcal{L},d z\rangle + \langle\nabla_{z_q} \mathcal{L}, q-z\rangle d(\ln r) $$

Here, $\langle\nabla_{z_q} \mathcal{L},d z\rangle$ is already present in standard VQ. DiVeQ introduces an additional $\langle\nabla_{z_q} \mathcal{L}, q-z\rangle d(\ln r)$, or in other words, it is equivalent to introducing an Auxiliary Loss $\sg[\langle\nabla_{z_q} \mathcal{L}, q-z\rangle] \ln r$. If $r$ represents the distance function between $q$ and $z$, then this term is reducing the distance between $q$ and $z$, which is analogous to the Auxiliary Losses introduced in VQ. This successfully explains DiVeQ theoretically.

But don’t celebrate too soon; this explanation holds only if the coefficient $\langle\nabla_{z_q} \mathcal{L}, q-z\rangle > 0$; otherwise, it would actually be increasing the distance. To demonstrate this, let’s consider the first-order approximation of the loss function $\mathcal{L}(z)$ at $z_q$:

$$ \mathcal{L}(z) \approx \mathcal{L}(z_q) + \langle\nabla_{z_q} \mathcal{L}, z - z_q\rangle = \mathcal{L}(z_q) + \langle\nabla_{z_q} \mathcal{L}, z - q\rangle $$

That is, $\langle\nabla_{z_q} \mathcal{L}, q-z\rangle\approx \mathcal{L}(z_q) - \mathcal{L}(z)$. Note that $z$ and $z_q$ are the features before and after VQ, respectively. VQ is a process that loses information, so using $z$ for the target task (e.g., reconstruction) will be easier than using $z_q$. Therefore, once convergence begins, it can be expected that the loss function for $z$ will be lower, i.e., $\mathcal{L}(z_q) - \mathcal{L}(z) > 0$. This proves that $\langle\nabla_{z_q} \mathcal{L}, q-z\rangle > 0$ is likely to hold.

Directions for Improvement
#

Strictly speaking, $\langle\nabla_{z_q} \mathcal{L}, q-z\rangle > 0$ can only be considered a necessary condition for DiVeQ’s effectiveness. To fully demonstrate its effectiveness, it would also be necessary to prove that this coefficient is “just right”. Due to the arbitrary nature of $r(q,z)$, we can only analyze specific functions individually. If we consider $r(q,z)=\Vert q-z\Vert^{\alpha}$, then it is equivalent to introducing the following Auxiliary Loss:

$$ \sg[\langle\nabla_{z_q} \mathcal{L}, q-z\rangle] \ln \Vert q-z\Vert^{\alpha}\approx \sg[\mathcal{L}(z_q) - \mathcal{L}(z)]\times \alpha\ln \Vert q-z\Vert $$

The coefficient $\mathcal{L}(z_q) - \mathcal{L}(z)$ is homogeneous with the main loss $\mathcal{L}(z_q)$, which means it can adapt well to the scale of the main loss and adjust the Auxiliary Loss weight based on the performance difference before and after VQ. As for the optimal value of $\alpha$, I believe it depends on experiments. I personally tried tuning it and found that $\alpha=1$ generally performs better. Interested readers can also try adjusting $\alpha$ themselves, or even try using other $r(q, z)$ functions.

It should be noted that DiVeQ only provides a new VQ training scheme free of Auxiliary Losses. In principle, it does not solve other VQ problems, such as low codebook utilization or codebook collapse. Enhancement techniques that were effective in the “STE + Aux Loss” scenario can also be considered for superposition with DiVeQ. The original paper combined DiVeQ with SFVQ to propose SF-DiVeQ, used to mitigate problems like codebook collapse.

However, I personally find SFVQ a bit cumbersome, so I won’t elaborate on it here. Moreover, the author’s choice to combine it with SFVQ is likely because SFVQ was his previous work, belonging to the same lineage. I prefer the linear transformation technique introduced in 《Another VQ Trick: Adding a Linear Transformation to the Codebook》, which involves adding another linear transformation after the codebook. Experiments show that this also significantly enhances DiVeQ’s performance.

Summary (formatted)
#

This article introduced a new training scheme for VQ (Vector Quantization). It can be implemented solely through STE, without the need for additional Auxiliary Losses, thus appearing exceptionally concise and elegant.

@online{kexuefm-11328,
        title={DiVeQ: A Very Concise VQ Training Scheme},
        author={苏剑林},
        year={2025},
        month={10},
        url={\url{https://kexue.fm/archives/11328}},
}

Discrete Encoding#

Gradient Design#

Alternative Approaches#

Introducing the Protagonist#

Theoretical Analysis#

Directions for Improvement#

Summary (formatted)#