Skip to main content

Translations from Jianlin Su

These are just select articles I wanted to read. Do not consider them vetted for correctness or otherwise accurate.

In fact, there are latex rendering bugs I have not addressed in the articles I have not read. I have made 0 effort to make e.g. \label->\ref matching or \newcommand instances work in KaTeX.

2025

A Very Concise VQ Training Scheme
·1929 words
Why Add Short Conv to Linear Attention?
·1228 words
Asymptotic Estimation of AdamW's Weight RMS
·2347 words
Rethinking Learning Rate and Batch Size (IV) - EMA
·1802 words
Rethinking Learning Rate and Batch Size (III) - Muon
·1053 words
Rethinking the Relationship between Learning Rate and Batch Size (II) - Mean Field
·2425 words
Why is Adam's Update RMS 0.2?
·1483 words
Rethinking the Relationship Between Learning Rate and Batch Size (Part 1) - Current Status
·1638 words
4. Muon + Spectral Sphere
·1319 words
3. Muon + Stiefel
·2666 words
2. Muon + Orthogonal
·1634 words
1. SGD + Hypersphere
·1509 words
Advancing Muon Further on its Scale-up Journey
·2970 words
Newton-Schulz Iteration for the msign Operator (Part II)
·2925 words
Newton-Schulz Iteration for the msign Operator (Part 1)
·2672 words
Simpler Yet More Profound Spectral Condition Scaling
·3053 words
Hyperparameter Scaling Laws Across Model Scales
·3590 words
Why Did We Choose to Try Muon?
·2404 words
Why is the default norm length for gradient clipping 1?
·1033 words

2024

Reflections on Novel Weight Decay from Spectral Norm Gradient
·1600 words
A Fundamental Leap from Vectors to Matrices
·3762 words
Understanding Adaptive Learning Rate Optimizers from the Perspective of Hessian Approximation
·1281 words
How Adam's Epsilon Affects the Learning Rate Scaling Law?
·1761 words
How Should the Learning Rate Change When Batch Size Increases?
·4065 words
Adding a Linear Transformation to the Codebook
·1564 words
SVD
·3195 words

2023

VQing the Key Makes Transformer Complexity Linear
·2766 words
'Rounding' Surpasses VQ-VAE
·1837 words
Exploring the Path to Minimums
·1828 words

2022

What's so Difficult About Training a 1000-Layer Transformer?
·2202 words

2021

Thoughts on Dimension Averaging Strategies for Non-Square Matrices in Initialization Methods
·903 words
A Brief Discussion on Initialization, Parameterization, and Normalization of Transformers
·3191 words

2020

Understanding Model Parameter Initialization Strategies from a Geometric Perspective
·1464 words

2019

Distribution of the Angle Between Two Random Vectors in $n$-Dimensional Space
·750 words
What Does BN Actually Do? An Analysis From First Principles
·2529 words
Vector Quantized AutoEncoder
·2634 words

2018

Generalization and Generative Models
·3667 words