Translations from Jianlin Su
These are just select articles I wanted to read. Do not consider them vetted for correctness or otherwise accurate.
In fact, there are latex rendering bugs I have not addressed in the articles I have not read. I have made 0 effort to make e.g. \label
->\ref
matching or \newcommand
instances work in KaTeX.
2025
4. Muon + Spectral Sphere
·1319 words
3. Muon + Stiefel
·2666 words
2. Muon + Orthogonal
·1634 words
1. SGD + Hypersphere
·1509 words
Advancing Muon Further on its Scale-up Journey
·2970 words
Newton-Schulz Iteration for the msign Operator (Part II)
·2925 words
Newton-Schulz Iteration for the msign Operator (Part 1)
·2672 words
Simpler Yet More Profound Spectral Condition Scaling
·3053 words
Hyperparameter Scaling Laws Across Model Scales
·3590 words
Why Did We Choose to Try Muon?
·2404 words
Why is the default norm length for gradient clipping 1?
·1033 words
2024
Reflections on Novel Weight Decay from Spectral Norm Gradient
·1600 words
A Fundamental Leap from Vectors to Matrices
·3762 words
Understanding Adaptive Learning Rate Optimizers from the Perspective of Hessian Approximation
·1281 words
How Adam's Epsilon Affects the Learning Rate Scaling Law?
·1761 words
How Should the Learning Rate Change When Batch Size Increases?
·4065 words
SVD
·3195 words
2023
Exploring the Path to Minimums
·1828 words
2022
What's so Difficult About Training a 1000-Layer Transformer?
·2202 words
2021
Thoughts on Dimension Averaging Strategies for Non-Square Matrices in Initialization Methods
·903 words
A Brief Discussion on Initialization, Parameterization, and Normalization of Transformers
·3191 words
2020
Understanding Model Parameter Initialization Strategies from a Geometric Perspective
·1464 words
2019
Distribution of the Angle Between Two Random Vectors in $n$-Dimensional Space
·750 words
What Does BN Actually Do? An Analysis From First Principles
·2529 words
2018
Generalization and Generative Models
·3667 words