Skip to main content

Road to a Better Transformer

These are gemini-2.5-flash-preview-04-17 translations of Jianlin Su’s Transformer升级之路 articles, vetted by native Chinese readers.

I picked 2.5 Flash for 3 reasons:

  1. it is fast & free (given only 20 articles to translate)
  2. no chunking required, output length is sufficient (and consistent)
  3. gemini models in general are better than claude/openai at ‘rote’ wholesale semantic copying tasks (such as translation). It is very difficult (personally, for me) to coax other models into avoiding summarization or omission or reshaping of inputs.

I’ve put more care into the technical accuracy of these translations than in the previous case. Nonetheless, you should be mindful of potential errors.

2025

19. The Second Type of Rotary Positional Embedding
·1241 words

2024

18. Principle for RoPE's Base Selection
·2411 words
17. Simple Thoughts on Multimodal Positional Encoding
·2732 words
16. A 'Review' of Length Extrapolation Techniques
·6415 words

2023

15. Key Normalization Boosts Length Extrapolation
·1983 words
14. When HWFA Meets ReRoPE
·1432 words
13. Inverse Use of Leaky ReRoPE
·1417 words
12. Infinitely Extrapolatable ReRoPE?
·2347 words
11. Carrying the Base-β Position to the End
·1464 words
10. RoPE is a β-Based Encoding
·2265 words
9. A New Approach to Global Length Extrapolation
·2070 words
8. Length Extrapolation and Positional Robustness
·2854 words
7. Length Extrapolation and Local Attention
·2094 words

2022

6. Completeness Analysis of Rotational Positional Encoding
·1112 words

2021

5. Linear Attention as Infinite Dimension
·1066 words
4. Rotary Position Embedding for 2D Positions
·2811 words
3. From Performer to Linear Attention
·1661 words
2. Rotary Position Embedding with Diverse Strengths
·2432 words
1. Tracing the Origins of Sinusoidal Position Encoding
·2220 words