Road to a Better Transformer
These are gemini-2.5-flash-preview-04-17 translations of Jianlin Su’s Transformer升级之路 articles, vetted by native Chinese readers.
I picked 2.5 Flash for 3 reasons:
- it is fast & free (given only 20 articles to translate)
- no chunking required, output length is sufficient (and consistent)
- gemini models in general are better than claude/openai at ‘rote’ wholesale semantic copying tasks (such as translation). It is very difficult (personally, for me) to coax other models into avoiding summarization or omission or reshaping of inputs.
I’ve put more care into the technical accuracy of these translations than in the previous case. Nonetheless, you should be mindful of potential errors.
2025
19. The Second Type of Rotary Positional Embedding
·1252 words
2024
18. Principle for RoPE's Base Selection
·2452 words
17. Simple Thoughts on Multimodal Positional Encoding
·2743 words
16. A 'Review' of Length Extrapolation Techniques
·6426 words
2023
15. Key Normalization Boosts Length Extrapolation
·1994 words
14. When HWFA Meets ReRoPE
·1443 words
13. Inverse Use of Leaky ReRoPE
·1428 words
12. Infinitely Extrapolatable ReRoPE?
·2358 words
11. Carrying the Base-β Position to the End
·1475 words
10. RoPE is a β-Based Encoding
·2276 words
9. A New Approach to Global Length Extrapolation
·2081 words
8. Length Extrapolation and Positional Robustness
·2882 words
7. Length Extrapolation and Local Attention
·2105 words
2022
6. Completeness Analysis of Rotational Positional Encoding
·1123 words
2021
5. Linear Attention as Infinite Dimension
·1077 words
4. Rotary Position Embedding for 2D Positions
·2850 words
3. From Performer to Linear Attention
·1672 words
2. Rotary Position Embedding with Diverse Strengths
·2452 words
1. Tracing the Origins of Sinusoidal Position Encoding
·2236 words