Skip to main content

Things I learned digging into 5090 perf

·1045 words
Table of Contents

This weekend, I evaluated (tried) 5090 matmul performance due to a tweet.

Raw mma perf
#

mmapeak is a simple tensor core benchmarking utility.

# ./mmapeak
----------------------------------------
Device 0: NVIDIA GeForce RTX 5090
  Compute capability: 12.0
  Total global memory: 31.4 GiB
  Multiprocessor count: 170
Running benchmarks with target time: 3.0 seconds

It does N_LOOP_INTERNAL=8192 iterations of mma.synced.aligned.* instructions on zeros living in register memory. See here for an mxf8 @ mxf8 -> f32 example.

On a 5090, I obtain:

A TypeB TypeShape32bit accTFlopsSpecSpec/MMAPeak
int4int48x8x32int3279.31676 (?)0.05
mxf4mxf416x8x641474.216760.88
nvf4nvf416x8x641474.216760.88
f4f416x8x32369.51676 (?)0.22
f4f416x8x32369.616760.22
f6f616x8x32369.4838 (?)0.44
f6f616x8x32369.6419 (?)0.88
mxf6mxf616x8x32369.7419 (?)0.88
mxf8mxf816x8x32737.14191.76
f8f816x8x32737.18380.88
f8f816x8x32370.04190.88
int8int816x16x16int32745.78380.89
int8int832x8x16int32745.88380.89
f16f1616x16x16745.84191.78
f16f1632x8x16745.94191.78
f16f1616x16x16370.6209.51.77
f16f1632x8x16370.6209.51.77
bf16bf1616x16x16370.6209.51.77
bf16bf1632x8x16370.6209.51.77
tf32tf3216x16x892.8104.80.89

A few things stand out:

  1. int4 gemm is gone on blackwell. RIP
  2. unscaled f4/f6 performance is trash, idk why.
  3. f16/bf16/mxf8 performance is ridiculously high. Enough to drastically shift GPU prices worldwide if true.

But, if the 5090 was so powerful out-of-the-box, it would have been discovered already. So there must be some catch.

torch.matmul
#

Since no one has sang prayers for the greatness of the 5090 yet, it is likely a simple matmul benchmark will show substantially worse numbers.

We can use the SemiAnalysis™️ Nvidia Matmul Benchmark for this. After running python matmul.py:

M, N, Kbf16 torch.matmulFP8 torch._scaled_mm (e5m2/e4m3fn)FP8 torch._scaled_mm (e4m3)
(16384, 8192, 1280)224.3 TFLOPS448.8 TFLOPS443.5 TFLOPS
(16384, 1024, 8192)183.1 TFLOPS402.2 TFLOPS400.3 TFLOPS
(16384, 8192, 7168)228.1 TFLOPS449.6 TFLOPS446.5 TFLOPS
(16384, 3584, 8192)228.0 TFLOPS439.2 TFLOPS435.5 TFLOPS
(8192, 8192, 8192)220.5 TFLOPS443.4 TFLOPS441.9 TFLOPS

These numbers are more boring. Against spec, they’re perhaps 5% better than anticipated.

Why?

Reasons for divergence
#

A chasm of complexity exists between:

  • a script that happens to use torch.matmul frequently
  • a program which repeatedly executes matrix-multiply-add on the same tile of zeros

And so, maybe you’d wonder:

  • does the script correctly track GPU kernel timings, without any CPU overhead?
  • do the high level APIs ultimately make use of well-optimized matmul kernels?
  • is the arithmetic intensity of typical problem sizes high enough for the 5090?
  • will is the benchmark actually doing what it claims to do?

But the answer to all of those questions are probably, ‘yes’.

The real problems are different.

Distributional differences
#

Going back to torch.matmul – if I simply change all tensors (and scales) to be zeros, this happens:

Shape (M, N, K)bf16 torch.matmulFP8 torch._scaled_mm (e5m2/e4m3fn)FP8 torch._scaled_mm (e4m3)
(16384, 8192, 1280)227.9 TFLOPS471.7 TFLOPS471.8 TFLOPS
(16384, 1024, 8192)185.6 TFLOPS422.0 TFLOPS422.1 TFLOPS
(16384, 8192, 7168)238.1 TFLOPS487.0 TFLOPS486.6 TFLOPS
(16384, 3584, 8192)236.2 TFLOPS473.3 TFLOPS473.3 TFLOPS
(8192, 8192, 8192)228.7 TFLOPS476.1 TFLOPS476.1 TFLOPS

There, there is some change, but certainly nowhere near the gap required to obtain 370TFLOPs on bf16. So, power usage is probably not the problem.

But then, if I make a simple edit to mmapeak, to change the initialization of the input problem from zeros to randn, the performance benefits of the half precision cases appear to vanish:

mma_f8f8f16_16_8_32
run: 2999.0 ms 737.3 T(fl)ops
mma_f8f8f32_16_8_32
run: 3000.4 ms 370.5 T(fl)ops
mma_s8s8s32_16_16_16
run: 3003.2 ms 372.7 T(fl)ops
mma_s8s8s32_32_8_16
run: 2999.8 ms 372.6 T(fl)ops
mma_f16f16f16_16_16_16
run: 3017.1 ms 359.1 T(fl)ops
mma_f16f16f16_32_8_16
run: 3003.7 ms 358.1 T(fl)ops
mma_f16f16f32_16_16_16
run: 2975.6 ms 187.1 T(fl)ops
mma_f16f16f32_32_8_16
run: 3000.3 ms 187.0 T(fl)ops
mma_bf16bf16f32_16_16_16
run: 2999.8 ms 187.4 T(fl)ops
mma_bf16bf16f32_32_8_16
run: 2999.1 ms 187.3 T(fl)ops

So, what is happening? Is my LLM-powered CUDA edit wrong? Is torch.matmul bottlenecked by memory bandwidth?

Honestly, no idea. I ran out of time for the weekend. Feel free to continue investigating in lieu of me.


Extra: MXFP8 things that don’t currently work
#

I was interested in getting the MXFP8 performance out of 5090s, so I tried the following:

Triton
#

Recently, a Block Scaled Matrix Multiplication tutorial was added to triton.

If you use the version of triton pinned to torch nightly, it will fail, due to missing imports.

If you build triton from source, it will also fail, as the compiled result is compatible with sm100 only. I lost the error logs for this after my runpod instance crashed.

CublasLt
#

In the latest version of CUDA, CublasLt’s block-scaled matmul routines will currently fail on sm120 devices with a CUBLAS_STATUS_NOT_SUPPORTED error.

This is why TransformerEngine recently banned non-TN fp8 GEMMs from 5090s.

Torchao
#

Torchao has recently implemented an MXTensor primitive, to seamlessly support training with block-scaled floating point tensors.

It has 3 compute backends:

  • emulated, which simply implements upcast + matmul
  • cublas, which is broken for the aforementioned reason
  • cutlass, which has sm100 hardcoded into torchao.

The lastmost point should be fixed in the future, as there is a clear cutlass example of MXFP8 GEMMs on geforce.

torch._scaled_mm
#

The _scaled_mm method dispatches on the likeness of the provided scale_a/scale_b:

  • scalar -> ScalingType::TensorWise
  • float8 -> ScalingType::BlockWise
  • fp32, [m,1]/[1,n] shaped -> ScalingType::RowWise

Tensor-wise and Row-wise are both traditional hopper-style pure fp8 mma + dequant kernels, which means they fall into the low-performing f8 mma bucket, which means they are useless for gamers

The block-wise matmul implementation in torch uses cublas, which is broken for the aforementioned reason.


Appendix: dependencies
#

For all experiments, I use a single Runpod 5090 instance, with image runpod/pytorch:2.8.0-py3.11-cuda12.8.1-cudnn-devel-ubuntu22.04.

After booting, I apt update–>install asciinema neovim cmake zlib1g-dev, and separately uv + micromamba.

Most of the time, the container’s pytorch packages are too old, and I install the following nightly packages:

pytorch-triton==3.3.0+git96316ce5
torch==2.8.0.dev20250412+cu128
torchao==0.11.0.dev20250413+cu128

Where appropriate, I build pytorch with a fresh micromamba python=3.11 environment.