Skip to main content

Translations from Jianlin Su

These are just select articles I wanted to read. Do not consider them vetted for correctness or otherwise accurate.

In fact, there are latex rendering bugs I have not addressed in the articles I have not read. I have made 0 effort to make e.g. \label->\ref matching or \newcommand instances work in KaTeX.

2025

Muon Optimizer Guide - Quick Start and Key Details
·1826 words
Asymptotic Estimation of AdamW's Weight RMS (Part 2)
·1987 words
Asymptotic Estimate for the Maximum of n Normal Random Variables
·1980 words
5. Dual Gradient Descent
·1524 words
Low-Precision Attention May Suffer from Biased Rounding Errors
·2347 words
1. Three Characteristics of a Good Model
·1583 words
Fast Estimation of the Spectral Norm of Random Matrices
·762 words
A Very Concise VQ Training Scheme
·1940 words
Why Add Short Conv to Linear Attention?
·1239 words
Asymptotic Estimation of AdamW's Weight RMS
·2376 words
Rethinking Learning Rate and Batch Size (IV) - EMA
·1813 words
Rethinking Learning Rate and Batch Size (III) - Muon
·1064 words
Rethinking the Relationship between Learning Rate and Batch Size (II) - Mean Field
·2436 words
Why is Adam's Update RMS 0.2?
·1511 words
Rethinking the Relationship Between Learning Rate and Batch Size (Part 1) - Current Status
·1649 words
4. Muon + Spectral Sphere
·1359 words
3. Muon + Stiefel
·2795 words
2. Muon + Orthogonal
·1645 words
1. SGD + Hypersphere
·1520 words
Advancing Muon Further on its Scale-up Journey
·2981 words
Newton-Schulz Iteration for the msign Operator (Part II)
·3014 words
Newton-Schulz Iteration for the msign Operator (Part 1)
·2775 words
Higher-Order muP - Simpler Yet More Profound Spectral Condition Scaling
·3066 words
Hyperparameter Scaling Laws Across Model Scales
·3601 words
Why Did We Choose to Try Muon?
·2415 words
Why is the default norm length for gradient clipping 1?
·1044 words

2024

Reflections on Novel Weight Decay from Spectral Norm Gradient
·1611 words
A Fundamental Leap from Vectors to Matrices
·3805 words
Understanding Adaptive Learning Rate Optimizers from the Perspective of Hessian Approximation
·1292 words
How Adam's Epsilon Affects the Learning Rate Scaling Law?
·1772 words
How Should the Learning Rate Change When Batch Size Increases?
·4076 words
Adding a Linear Transformation to the Codebook
·1575 words
SVD
·3206 words

2023

VQing the Key Makes Transformer Complexity Linear
·2777 words
'Rounding' Surpasses VQ-VAE
·1848 words
Exploring the Path to Minimums
·1839 words

2022

What's so Difficult About Training a 1000-Layer Transformer?
·2213 words

2021

Thoughts on Dimension Averaging Strategies for Non-Square Matrices in Initialization Methods
·914 words
A Brief Discussion on Initialization, Parameterization, and Normalization of Transformers
·3212 words

2020

Understanding Model Parameter Initialization Strategies from a Geometric Perspective
·1486 words

2019

Distribution of the Angle Between Two Random Vectors in $n$-Dimensional Space
·761 words
What Does BN Actually Do? An Analysis From First Principles
·2540 words
Vector Quantized AutoEncoder
·2645 words

2018

Generalization and Generative Models
·3780 words