Newton-Schulz Iteration for the msign Operator (Part II)

Table of Contents

This is a gemini-2.5-flash translation of a Chinese article.

It has NOT been vetted for errors. You should have the original article open in a parallel tab at all times.

By Su Jianlin | 2025-06-05 | 15,506 readers

In the previous article 《Newton-Schulz Iteration for the msign Operator (Part I)》, we attempted to find better Newton-Schulz iterations for the $\mathop{\text{msign}}$ operator, with the aim of achieving the highest possible approximation degree within a finite number of iterations. This process can be transformed into finding polynomial iterations of the same form for the scalar function $\mathop{\text{sign}}(x)$. At that time, our approach was to find a local optimal solution end-to-end using the Adam optimizer, which was effective but somewhat crude.

A few days ago, a new paper was published on arXiv, 《The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm》. The authors applied a series of ingenious mathematical conclusions to provide a more elegant answer in a rigorous manner. In this article, let us appreciate and learn from this excellent paper.

Problem Description
#

We will not repeat the relevant background and transformation process. Directly stating the problem we need to solve:

$$ \begin{equation}\mathop{\text{argmin}}_f d(f(x),1)\end{equation} $$

where $f = f_T \circ \dots \circ f_2 \circ f_1$, $\circ$ represents function composition, $f_t(x)$ is an odd polynomial with respect to $x$ (containing only odd powers of $x$), and $d(f(x),1)$ is some metric measuring the distance between $f(x)$ and $1$. In the previous article, we uniformly selected a finite number of points in $[0,1]$ and took the average of the largest $k$ values of $|f(x)-1|$ as the metric. This paper, however, directly uses the maximum value of $|f(x)-1|$ within the interval as the metric, i.e.,

$$ \begin{equation}\mathop{\text{argmin}}_f \max_{x\in[l,u]} |f(x) - 1| \end{equation} $$

where $[l,u]\subset [0,1]$. It should be noted that $u$ can be directly set to 1, but $l$ cannot be 0, because $f(0)$ is always 0, which means the above expression will always be greater than or equal to 1 and cannot converge, so $l$ must be chosen as a number very close to 0. According to the analysis in the previous article, for universality, we should account for singular values around $0.001$, thus $l=0.001$ can be considered.

Polar Decomposition For a square matrix $\boldsymbol{M}\in\mathbb{R}^{n\times n}$, its polar decomposition is $\boldsymbol{M}=\boldsymbol{Q}\boldsymbol{S}$, where $\boldsymbol{Q}$ is an orthogonal matrix, and $\boldsymbol{S}$ is a positive semi-definite matrix.

If the SVD of $\boldsymbol{M}$ is $\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top}$, then we exactly have

$$ \begin{equation}\boldsymbol{M} = \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top} = (\boldsymbol{U}\boldsymbol{V}^{\top})(\boldsymbol{V}\boldsymbol{\Sigma}\boldsymbol{V}^{\top})\end{equation} $$

That is, $\boldsymbol{Q}=\boldsymbol{U}\boldsymbol{V}^{\top},\boldsymbol{S}=\boldsymbol{V}\boldsymbol{\Sigma}\boldsymbol{V}^{\top}$ is one solution for the polar decomposition. And we know that when $\boldsymbol{M}$ is a full-rank matrix, $\boldsymbol{U}\boldsymbol{V}^{\top}$ is exactly $\mathop{\text{msign}}(\boldsymbol{M})$. This is why $\mathop{\text{msign}}$ is connected to “Polar,” because once it is found, the matrix’s “Polar Decomposition” can be obtained. In other words, the essential difficulty of polar decomposition is calculating $\mathop{\text{msign}}$, which is similar to Muon.

Greedy is Enough
#

Returning to the main topic. For problem (eq:opt), the first conclusion drawn from the original paper, and arguably the most core conclusion of the entire paper, is: Its greedy solution is exactly its global optimal solution! In formulaic terms, this means that the solution to problem (eq:opt) can be transformed into:

$$ \begin{equation}\begin{gathered} f^* = f_T^* \circ \dots \circ f_2^* \circ f_1^* \\[12pt] f_1^* = \mathop{\text{argmin}}_{f_1} \max_{x\in[l_1,u_1]} |f_1(x) - 1| \\ f_2^* = \mathop{\text{argmin}}_{f_2} \max_{x\in[l_2,u_2]} |f_2(x) - 1| \\ \vdots \\ f_T^* = \mathop{\text{argmin}}_{f_T} \max_{x\in[l_T,u_T]} |f_T(x) - 1| \\[24pt] l_1 = l,\quad u_1 = u, \\[8pt] l_{t+1} = \min_{x\in[l_t,u_t]} f_t^*(x),\quad u_{t+1} = \max_{x\in[l_t,u_t]} f_t^*(x) \end{gathered}\end{equation} $$

I believe this conclusion will surprise many readers; I was quite astonished and applauded it when I first saw it. It not only greatly reduces the difficulty of solving, transforming the original $T$-step compound function solving into a step-by-step solving of a single polynomial for $T=1$, but it also allows us to push the solution forward step by step, maintaining optimality (i.e., the optimal solution for $T+1$ steps only requires one more step based on the $T$-step optimal solution, without needing to recalculate from scratch).

It is worth noting that this conclusion allows each $f_t$ to have different degrees (here, “degree” refers to the highest power of the polynomial). For example, $f_1$ can be degree 3, $f_2$ can be degree 5, and so on, but the conclusion “greedy solution is exactly global optimal solution” remains unchanged. For simplicity, however, we will keep all $f_t$ at the same degree below, and mainly consider results for degree 3 and 5.

The complete proof of the above conclusion is slightly complex, so we will put it at the end and first complete the subsequent operations based on this conclusion.

Equioscillation
#

Since we have transformed the original problem into finding a greedy solution, we now only need to focus on solving

$$ \begin{equation}\mathop{\text{argmin}}_{f_t} \max_{x\in[l_t,u_t]} |f_t(x) - 1| \end{equation} $$

To solve the above equation, we first need to understand the “Equioscillation Theorem” for odd polynomials, as introduced in 《Equioscillation Theorem: Necessary and Sufficient Conditions for Optimal Polynomial Approximation》:

Equioscillation Theorem - Odd Let $f(x)$ be an odd polynomial of degree at most $2n+1$, and $g(x)$ be a continuous function on the interval $[a,b]\subset (0,\infty)$. Then

$$ \begin{equation}f^* = \mathop{\text{argmin}}_f \max_{x\in[a,b]} |f(x) - g(x)|\end{equation} $$

The necessary and sufficient condition is that there exist $a\leq x_0 < x_1 < \cdots < x_{n+1} \leq b$ and $\sigma\in\{0,1\}$ such that

$$ \begin{equation}f^*(x_k) - g(x_k) = (-1)^{k+\sigma} \max_{x\in[a,b]} |f^*(x) - g(x)|\end{equation} $$

Now we need to find $f_t$, and the target $g$ is constant $1$. The Equioscillation Theorem tells us that $|f_t^*(x)-1|$ reaches the maximum error (denoted as $\mathcal{E}$) at least $n+2$ times in $[l_t,u_t]$. It is easy to see that the maximum points of $|f_t^*(x)-1|$ can only be boundary points or critical points of $f_t^*(x)$. An odd polynomial of degree $2n+1$ has at most $n$ critical points in $(0,\infty)$. Therefore, to “gather” enough $n+2$ points, we “must” include the boundary points, which determines $x_0 = l_t, x_{n+1}=u_t$, and $x_1,\cdots,x_n$ are the zeros of $\frac{d}{dx}f_t^*(x)$.

Furthermore, since the target function is $1$, the slope of $f_t^*(x)$ at $x=0$ is greater than zero, so $l_t$ must be a minimum point of $f_t^*(x)$, and therefore $\sigma=1$. Combining these results, we actually need to solve the system of equations:

$$ \begin{equation}f_t(l_t) = 1 - \mathcal{E}, \quad f_t(u_t) = 1 + (-1)^n \mathcal{E},\quad f_t(x_i) = 1 + (-1)^{i+1}\mathcal{E}, \quad f_t'(x_i) = 0\end{equation} $$

where $i=1,2,3,\cdots,n$. It can be seen that both equations and unknowns are $2n+2$. Plus the constraints $l_t < x_1 < \cdots < x_n < u_t$ and $\mathcal{E} > 0$, the solution can theoretically be determined.

Solving the System of Equations
#

For a degree 3 odd polynomial ($n=1$), the original paper provides an analytical solution. For a degree 5 odd polynomial ($n=2$), the original paper provides an iterative solution algorithm: first fix $x_1,x_2$ to solve for $a,b,c$, then fix $a,b,c$ of $f_t(x)$ to solve for $x_1,x_2$, iterating repeatedly. This is essentially a simplified version of the Remez algorithm.

However, the original paper’s iteration relies on root-finding formulas to solve for $x_1,x_2$, which becomes less manageable for larger $n$. So here, I’ll change the solving approach: first, parameterize $f_t'(x)$ with $x_1,x_2,\cdots,x_n$ by defining

$$ \begin{equation}f_t'(x) = k(x^2-x_1^2)(x^2-x_2^2)\cdots (x^2-x_n^2)\end{equation} $$

Then $f_t(x) = \int_0^x f_t'(x) dx$. This way, we express $f_t(x)$ using $k$ and $x_1,x_2,\cdots,x_n$, and then only need to solve the system of equations:

$$ \begin{equation}f_t(l_t) = 1 - \mathcal{E}, \quad f_t(u_t) = 1 + (-1)^n \mathcal{E},\quad f_t(x_i) = 1 + (-1)^{i+1}\mathcal{E}\end{equation} $$

avoiding solving the equation $f_t'(x) = 0$. When $n=1$, we can solve for

$$ \begin{equation}x_1 = \sqrt{\frac{l_t^2+l_t u_t + u_t^2}{3}},\quad k = -\frac{6}{l_t^2 u_t + l_t u_t^2 + 2x_1^3}\end{equation} $$

When $n > 1$, we can directly use Mathematica. For example, when $n=2$:

df[x_] = k*(x^2 - x1^2) (x^2 - x2^2);
f[x_] = Integrate[df[x], {x, 0, x}];
sol = NSolve[{f[l] == 1 - e, f[x1] == 1 + e, f[x2] == 1 - e,
    f[u] == 1 + e, l < x1 < x2 < u, e > 0} /. {l -> 0.001,
    u -> 1}, {k, x1, x2, e}, Reals]
f[x] /. sol

Finite Precision
#

Up to this point, it seems we have completed solving the original problem? Theoretically, yes, but only for infinite precision. Actual computations use finite precision, especially since the Muon optimizer uses bfloat16, where precision loss is more severe, which leads to some issues.

The first problem is that each $f_t^*$ is theoretically only responsible for the interval $[l_t,u_t]$, but with finite precision, singular values might deviate from this interval. When $n$ is an even number (i.e., $f_t^*$ is an odd polynomial of degree 5, 9, …), there is a risk of divergence if values exceed $u_t$, because $f_t^*(x)$ is monotonically increasing to positive infinity when $x > u_t$, and slight errors can lead to divergence with iteration. There are two solutions: first, make the interval $[l_t,u_t]$ slightly wider when calculating $f_t^*$; second, keep the interval unchanged, but divide the input by a number greater than 1 after obtaining $f_t^*$.

The original paper uses the latter, changing $f_t^*(x)$ to $f_t^*(x / 1.01)$. The number 1.01 is approximately the first number after 1 in bfloat16 precision (the exact value is 1.00781), which is clearly to prevent numerical errors from expanding singular values from 1 to the next representable number. If calculations are performed with higher precision, this value can be appropriately reduced.

The second problem is more subtle, and we will illustrate it with a specific example. Let $n=2,l_1=0.001,u_1=1$. We can find $f_1^*$ to be

$$ \begin{equation}f_1^*(x) = 8.4703 x - 25.1081 x^3 + 18.6293 x^5\end{equation} $$

where $x_1 = 0.3674, x_2 = 0.8208, \mathcal{E}=0.9915$. What is wrong with this solution? According to the Equioscillation Theorem, we know that $f_1^*(x_2) = 1-\mathcal{E} = 0.0085$, meaning it maps $0.8208$ to $0.0085$. However, our ultimate goal is to transform all numbers in $(0,1]$ into 1. So $f_1^*$ maps $0.8208$, which is already very close to the target, to $0.0085$, which is very far from the target. Although $f_2^*,f_3^*,\cdots$ would theoretically pull it back gradually, repeatedly scaling a number down and then up in finite precision results in significant accumulated error.

Of course, as the Equioscillation Theorem shows, this oscillatory behavior is unavoidable. We can only hope that the maximum error $\mathcal{E}$ is not too close to 1, thereby mitigating this accumulated error. It is easy to see that the larger the interval $[l_t,u_t]$, the theoretically harder it is to fit, and the closer the maximum error $\mathcal{E}$ will be to 1. Therefore, the paper introduces a hyperparameter $\lambda \in (0, 1)$, changing the optimization interval from $[l_t,u_t]$ to $[\max(l_t, \lambda u_t),u_t]$, limiting the interval size to ensure that $\mathcal{E}$ does not become too large. (Readers should be reminded that the paper uses $\lambda=0.1$ in the main text, but the appendix code actually uses $\lambda=0.024$.)

But then, wouldn’t the original $l_t$, especially the $l$ we initially set, be easily ignored? To address this, the paper introduces a “Recenter” technique: that is, if the optimization interval is $[l_t,u_t]$, then $f_t^*(l_t) + f_t^*(u_t) = 2$ will be satisfied. However, it is not necessarily satisfied when the optimization interval is changed to $[\max(l_t, \lambda u_t),u_t]$. At this point, we multiply $f_t^*$ by $\gamma$ to make it satisfy this equality:

$$ \begin{equation}\gamma f_t^*(l_t) + \gamma f_t^*(u_t) = 2\qquad \Rightarrow \qquad \gamma = \frac{2}{f_t^*(l_t) + f_t^*(u_t)}\end{equation} $$

This incorporates the original $l_t$.

Reference Code
#

This is the complete Mathematica code for $n=2$:

df[x_] = k*(x^2 - x1^2) (x^2 - x2^2);
f[x_] = Integrate[df[x], {x, 0, x}];
sol[l_, u_] :=
 NSolve[{f[l] == 1 - e, f[x1] == 1 + e, f[x2] == 1 - e, f[u] == 1 + e,
    l < x1 < x2 < u, e > 0, k > 0}, {k, x1, x2, e}]
ff[x_, l_, u_] = f[x]*2/(f[l] + f[u]) // Expand;
lt = 0.001; ut = 1; lambda = 0.02407327424182761;
While[1 - lt > 0.0001,
 fff[x_] = ff[x, lt, ut] /. sol[Max[lt, lambda*ut], ut][[1]];
 Print[fff[x]];
 lt = fff[lt]; ut = 2 - lt]

The results are as follows ($f_t(x) = a_t x + b_t x^3 + c_t x^5$):

$$ \begin{array}{c|ccc} \hline t & a\times 1.01 & b\times 1.01^3 & c\times 1.01^5 \\ \hline \quad 1\quad & 8.28721 & -23.5959 & 17.3004 \\ 2 & 4.10706 & -2.94785 & 0.544843 \\ 3 & 3.94869 & -2.9089 & 0.551819 \\ 4 & 3.31842 & -2.48849 & 0.510049 \\ 5 & 2.30065 & -1.6689 & 0.418807 \\ 6 & 1.8913 & -1.268 & 0.376804 \\ 7 & 1.875 & -1.25 & 0.375 \\ 8 & 1.875 & -1.25 & 0.375 \\ \hline \end{array} $$

Note that the results given here are before applying the $f_t^*(x / 1.01)$ transformation, so the actual $a, b, c$ values need to be further divided by $1.01$ to the powers of $1, 3, 5$ respectively, based on this table. The results after dividing by $1.01$ are not directly given because the convergence values before division, $1.875, -1.25, 0.375$ (for $t \geq 7$), appear simpler and clearer, making them easier to observe and appreciate. (Food for thought: Please prove that the final converged values can be derived from $x_1=x_2=1$ and $f(1)=1$.)

The author’s appendix code is organized as follows:

import numpy as np

def optimal_quintic(l, u):
    assert 0 <= l <= u
    if 1 - 5e-6 <= l / u:
        # Above this threshold, the equoscillating polynomials
        # is numerically equal to...
        return (15 / 8) / u, (-10 / 8) / (u**3), (3 / 8) / (u**5)
    # This initialization becomes exact as l -> u
    q = (3 * l + 1) / 4
    r = (l + 3) / 4
    E, old_E = np.inf, None
    while not old_E or abs(old_E - E) > 1e-15:
        old_E = E
        LHS = np.array([
            [l, l**3, l**5, 1],
            [q, q**3, q**5, -1],
            [r, r**3, r**5, 1],
            [u, u**3, u**5, -1],
        ])
        a, b, c, E = np.linalg.solve(LHS, np.ones(4))
        q, r = np.sqrt(
            (-3 * b + np.array([-1, 1]) * np.sqrt(9 * b**2 - 20 * a * c)) /
            (10 * c)
        )
    return float(a), float(b), float(c)

def optimal_composition(l, num_iters, cushion=0.02407327424182761):
    u = 1
    coefficients = []
    for _ in range(num_iters):
        a, b, c = optimal_quintic(max(l, cushion * u), u)
        # Due to cushioning , this may be centered around 1 with
        # respect to 0.024*u, u. Recenter it around 1 with respect
        # to l, u, meaning find c so that 1 - c*p(l) = c*p(u) - 1:
        pl = a * l + b * l**3 + c * l**5
        pu = a * u + b * u**3 + c * u**5
        rescalar = 2 / (pl + pu)
        a *= rescalar
        b *= rescalar
        c *= rescalar
        # Optionally incorporate safety factor here :
        # a /= 1.01; b /= 1.01**3; c /= 1.01**5
        coefficients.append((a, b, c))
        l = a * l + b * l**3 + c * l**5
        u = 2 - l
    return coefficients

print(*optimal_composition(1e-3, 10), sep="\n")

Complete Proof
#

In the last section, we will complete the proof that “the greedy solution is exactly its global optimal solution.”

According to the Equioscillation Theorem, we know that the range of $f_t^*$ is $[l_{t+1},u_{t+1}]$, where $l_{t+1}=f_t^*(l_t)$ and $u_{t+1}=2-l_{t+1}$. From this, it can be seen that the maximum error of the $T$-step greedy solution is $\mathcal{E}_T = 1 - l_{T+1} = 1 - f_T^*(l_T)$. We only need to prove that the maximum error of the $T$-step global optimal solution can also only be reduced to $1 - f_T^*(l_T)$ to obtain the conclusion that “the greedy solution is exactly its global optimal solution.”

The proof strategy is mathematical induction. Assume the conclusion holds for $t=1,2,\cdots,T-1$. Then $\hat{f} = f_{T-1}^*\circ \cdots \circ f_2^* \circ f_1^*$ is the global optimal solution for $T-1$ steps, with range $[l_T, u_T]$ and maximum error $\mathcal{E}_{T-1}=1-l_T=u_T-1$. On the other hand, let $\tilde{f} = \tilde{f}_{T-1}\circ \cdots \circ \tilde{f}_2 \circ \tilde{f}_1$ be any $T-1$ step solution, with range $[a,b]$. Let $c = \frac{2}{a+b}$. Then the range of $c\tilde{f}$ is $[ca,cb]$, where clearly $ca\leq 1, cb\geq 1$. According to the inductive hypothesis, we have

$$ \begin{equation}\begin{aligned} 1 - ca \geq \mathcal{E}_{T-1} \\ cb - 1 \geq \mathcal{E}_{T-1} \end{aligned}\qquad\Rightarrow\qquad \frac{a}{b} \leq \frac{1 - \mathcal{E}_{T-1}}{1 + \mathcal{E}_{T-1}} = \frac{l_T}{u_T} \end{equation} $$

That is, the relative size of the range of any $T-1$ step solution is not less than the relative size of the range $[l_T, u_T]$ of the $T-1$ step optimal solution. Next, we have

$$ \begin{equation}\begin{aligned} \min_{f_T} \max_{x\in[l,u]} |f_T(\tilde{f}(x)) - 1| =&\, \min_{f_T} \max_{x\in[a,b]} |f_T(x) - 1| \\ =&\, \min_{f_T} \max_{x\in[a/b,1]} |f_T(x) - 1| \\ \geq &\, \min_{f_T} \max_{x\in[l_T/u_T,1]} |f_T(x) - 1| \\ =&\, \min_{f_T} \max_{x\in[l_T,u_T]} |f_T(x) - 1| \\ =&\, \mathcal{E}_T \end{aligned}\end{equation} $$

In other words, if you take any other $T-1$ step solution, its maximum error can at most be as small as that of the greedy solution, so the maximum error of the greedy solution is already globally optimal. This completes the recursive proof. A key step in the above equation is

$$ \begin{equation}\min_{f_T} \max_{x\in[a,b]} |f_T(x) - 1| = \min_{f_T} \max_{x\in[a/b,1]} |f_T(x) - 1|\end{equation} $$

This is because we can always set $g_T(y) = f_T(b y)$, and $g_T$ can still represent any odd polynomial of the same degree. So $g_T$ and $f_T$ are in the same function space, and thus the notation can be interchanged, i.e.,

$$ \begin{equation}\min_{f_T}\max_{x\in[a,b]} |f_T(x) - 1| = \min_{g_T}\max_{y\in[a/b,1]} |g_T(y) - 1|= \min_{f_T}\max_{x\in[a/b,1]} |f_T(x) - 1|\end{equation} $$

Summary (formatted)
#

This article introduces the latest progress in finding better Newton-Schulz iterations for the msign operator. It directly finds the theoretically optimal solution through the Equioscillation Theorem and greedy transformation, a highly rigorous process worth learning.

@online{kexuefm-10996,
        title={Newton-Schulz Iteration for the msign Operator (Part II)},
        author={苏剑林},
        year={2025},
        month={06},
        url={\url{https://kexue.fm/archives/10996}},
}

Problem Description#

Greedy is Enough#

Equioscillation#

Solving the System of Equations#

Finite Precision#

Reference Code#

Complete Proof#

Summary (formatted)#