↓Skip to main content

Translations from Jianlin Su

These are just select articles I wanted to read. Do not consider them vetted for correctness or otherwise accurate.

In fact, there are latex rendering bugs I have not addressed in the articles I have not read. I have made 0 effort to make e.g. \label->\ref matching or \newcommand instances work in KaTeX.

2025

A Very Concise VQ Training Scheme

8 October 2025·1929 words

Why Add Short Conv to Linear Attention?

5 October 2025·1228 words

Asymptotic Estimation of AdamW's Weight RMS

1 October 2025·2347 words

Rethinking Learning Rate and Batch Size (IV) - EMA

22 September 2025·1802 words

Rethinking Learning Rate and Batch Size (III) - Muon

15 September 2025·1053 words

Rethinking the Relationship between Learning Rate and Batch Size (II) - Mean Field

10 September 2025·2425 words

Why is Adam's Update RMS 0.2?

2 September 2025·1483 words

Rethinking the Relationship Between Learning Rate and Batch Size (Part 1) - Current Status

1 September 2025·1638 words

4. Muon + Spectral Sphere

21 August 2025·1319 words

3. Muon + Stiefel

8 August 2025·2666 words

2. Muon + Orthogonal

6 August 2025·1634 words

1. SGD + Hypersphere

1 August 2025·1509 words

Advancing Muon Further on its Scale-up Journey

12 July 2025·2970 words

Newton-Schulz Iteration for the msign Operator (Part II)

5 June 2025·2925 words

Newton-Schulz Iteration for the msign Operator (Part 1)

11 May 2025·2672 words

Simpler Yet More Profound Spectral Condition Scaling

24 March 2025·3053 words

Hyperparameter Scaling Laws Across Model Scales

13 March 2025·3590 words

Why Did We Choose to Try Muon?

27 February 2025·2404 words

Why is the default norm length for gradient clipping 1?

2 January 2025·1033 words

2024

Reflections on Novel Weight Decay from Spectral Norm Gradient

25 December 2024·1600 words

A Fundamental Leap from Vectors to Matrices

10 December 2024·3762 words

Understanding Adaptive Learning Rate Optimizers from the Perspective of Hessian Approximation

29 November 2024·1281 words

How Adam's Epsilon Affects the Learning Rate Scaling Law?

18 November 2024·1761 words

How Should the Learning Rate Change When Batch Size Increases?

14 November 2024·4065 words

Adding a Linear Transformation to the Codebook

6 November 2024·1564 words

1 October 2024·3195 words

2023

VQing the Key Makes Transformer Complexity Linear

9 November 2023·2766 words

'Rounding' Surpasses VQ-VAE

31 October 2023·1837 words

Exploring the Path to Minimums

16 June 2023·1828 words

2022

What's so Difficult About Training a 1000-Layer Transformer?

9 March 2022·2202 words

2021

Thoughts on Dimension Averaging Strategies for Non-Square Matrices in Initialization Methods

18 October 2021·903 words

A Brief Discussion on Initialization, Parameterization, and Normalization of Transformers

17 August 2021·3191 words

2020

Understanding Model Parameter Initialization Strategies from a Geometric Perspective

16 January 2020·1464 words

2019

Distribution of the Angle Between Two Random Vectors in $n$-Dimensional Space

13 November 2019·750 words

What Does BN Actually Do? An Analysis From First Principles

11 October 2019·2529 words

Vector Quantized AutoEncoder

24 June 2019·2634 words

2018

Generalization and Generative Models

7 October 2018·3667 words