Last week, I had temporary remote access to a node with 4090 48GB GPUs.
As far as I can tell: they work as expected with 0 issues or gotchas. In particular,
- no special drivers/software are needed. Standard nvidia drivers work OOTB.
- the memory is real; host2device->device2host copies
>24GB
are accurate. - perf matches expectations. The FLOPS/membw available is similar to a normal 4090.
- I additionally tested Flux LoRA training via SimpleTuner, which had results identical to a 6000Ada run. Unfortunately I was unable to record this information clearly before running out of time.
Experiments#
As I was ill-prepared for the situation, I only managed to scramble for a few experiments before I ran out of time:
When I receive a physical copy (I am ordering some), I will update this post with proper tests, in accordance with what SemiAnalysis uses in their recent AMD benchmarking article.
ml-engineering#
Stas Bekman’s delightful ml-engineering repository provides an abundance of information, but for the purpose of validating a 4090, I only really care about their MAMF tests.
MAMF#
According to the Ada whitepaper, a standard 4090 should have peak 165.2 FLOP/S with a HGEMM. According to mamf-finder.py
, the 48GB 4090 tested has ~103% of that:
Ending:
The best outcome was 170.3TFLOPS @ 10240x4096x4096 (MxNxK) (tried 79 shapes)
all_reduce_bench#
Since I had a 7x node, I tested all_reduce_bench.py
for fun:
The average bandwidth of all_reduce with a 4.0GB payload (5 trials, 7 ranks):
algbw: 9.897 GBps (79.2 Gbps)
busbw: 16.967 GBps (135.7 Gbps)
Doing this benchmark was pointless, as the node did not have the Tinygrad P2P Driver Patch installed.
Although I did not test installation of the P2P patch, I see no reason to expect it to fail, given that the node itself uses standard nvidia drivers.
gpt-fast#
After testing FLOPs, the next goal was to test for memory bandwidth.
Batch size 1 LLM decoding throughput is well-known to be memory bandwidth bound, so I made use of gpt-fast on the latest stable PyTorch to model a reasonably popular membw-bound workload.
Due to the horrible network speeds of the node I was using, I was unable to directly download the weights for any model.
So, I used this script to generate fresh weights locally:
import torch
from sys import argv
from pathlib import Path
# from gpt-fast/model.py
from model import Transformer, ModelArgs
# e.g. "Meta-Llama-3-8B"
# "checkpoints/meta-llama/Meta-Llama-3-8B/model.pth"
model_name, fpath = sys.argv[1:]
torch.set_default_dtype(torch.bfloat16)
with torch.device('cuda'):
model = Transformer(ModelArgs.from_name(model_name))
torch.save(model.state_dict(), fpath)
As some may know, different input data distributions can produce slightly different performance speeds on the same operations.
Although I believe the performance gap between random kaiming uniform tensors and true ‘organic’ AdamW optimized tensors should be insignificant, smart readers may want to add their own confidence intervals to the results.
I executed the following commands:
$ python generate.py --compile --checkpoint_path checkpoints/meta-llama/Meta-Llama-3-8B/model.pth --prompt "Hello, my name is"
Using device=cuda
Loading model ...
Time to load model: 2.78 seconds
Compilation time: 64.74 seconds
Time for inference 1: 3.29 sec total, 60.99 tokens/sec
Bandwidth achieved: 915.50 GB/s
FLOPS achieved: 0.94 TF/s
Time for inference 2: 3.28 sec total, 61.04 tokens/sec
Bandwidth achieved: 916.16 GB/s
FLOPS achieved: 0.94 TF/s
Time for inference 3: 3.28 sec total, 61.00 tokens/sec
Bandwidth achieved: 915.64 GB/s
FLOPS achieved: 0.94 TF/s
Time for inference 4: 3.29 sec total, 61.02 tokens/sec
Bandwidth achieved: 915.95 GB/s
FLOPS achieved: 0.94 TF/s
Time for inference 5: 3.28 sec total, 61.02 tokens/sec
Bandwidth achieved: 915.95 GB/s
FLOPS achieved: 0.94 TF/s
==========
Batch Size: 1
Prompt Length: 6
Generated tokens: 200
Average tokens/sec: 61.01
Memory used: 16.43 GB
and for 13b:
$ python generate.py --compile --checkpoint_path checkpoints/meta-llama/Llama-2-13b-chat-hf/model.pth --prompt "Hello, my name is"
...
Time for inference 5: 5.52 sec total, 36.22 tokens/sec
Bandwidth achieved: 931.01 GB/s
FLOPS achieved: 0.96 TF/s
==========
Batch Size: 1
Prompt Length: 6
Generated tokens: 200
Average tokens/sec: 36.22
Memory used: 26.67 GB
Notably, this defeats the performance of the 6000Ada on the same task:
Time for inference 5: 6.04 sec total, 33.09 tokens/sec
Bandwidth achieved: 850.45 GB/s
FLOPS achieved: 0.88 TF/s
==========
Batch Size: 1
Prompt Length: 6
Generated tokens: 200
Average tokens/sec: 32.45
Memory used: 26.67 GB
For a local LLM user, there is almost no value in purchasing a 6000Ada, or even the outdated A6000, over a 4090 48GB.
flux-fp8-api#
At the insistence of a collaborator, I tested the performance of the 4090 48GB on Flux.1-dev inference as well.
I made use of flux-fp8-api for this, as I also wanted to test FP8 performance, to isolate the possibility of the ‘4090’ being a faked rename of a pre-Ada GPU.
In testing, a modified version of the 6000Ada inference config was used:
{
"version": "flux-dev",
"params": {
"in_channels": 64,
"vec_in_dim": 768,
"context_in_dim": 4096,
"hidden_size": 3072,
"mlp_ratio": 4.0,
"num_heads": 24,
"depth": 19,
"depth_single_blocks": 38,
"axes_dim": [
16,
56,
56
],
"theta": 10000,
"qkv_bias": true,
"guidance_embed": true
},
"ae_params": {
"resolution": 256,
"in_channels": 3,
"ch": 128,
"out_ch": 3,
"ch_mult": [
1,
2,
4,
4
],
"num_res_blocks": 2,
"z_channels": 16,
"scale_factor": 0.3611,
"shift_factor": 0.1159
},
"ckpt_path": "/big/generator-ui/flux-testing/flux/model-dir/flux1-dev.sft",
"ae_path": "/big/generator-ui/flux-testing/flux/model-dir/ae.sft",
"repo_id": "black-forest-labs/FLUX.1-dev",
"repo_flow": "flux1-dev.sft",
"repo_ae": "ae.sft",
"text_enc_max_length": 512,
"text_enc_path": "city96/t5-v1_1-xxl-encoder-bf16",
"text_enc_device": "cuda:0",
"ae_device": "cuda:0",
"flux_device": "cuda:0",
"flow_dtype": "bfloat16",
"ae_dtype": "bfloat16",
"text_enc_dtype": "bfloat16",
"flow_quantization_dtype": "bfloat16",
"text_enc_quantization_dtype": "bfloat16",
"compile_extras": true,
"compile_blocks": true,
"offload_text_encoder": false,
"offload_vae": false,
"offload_flow": false
}
Note: qfloat8
quantization was replaced with bfloat16
to avoid invoking Quanto/torch-cublas-hgemm, which required nvcc
for compilation, which I lacked the prudence and time to install. Internally, flux-fp8-api will still use fp8 GEMM by downcasting bfloat16
weights appropriately at inference time.
For the “✅ compile blocks & extras” config described in the README, these are the inference it/s
, compared with standard GPU results:
Resolution | 4090 48G | 4090 | 6000A | |
---|---|---|---|---|
1024x1024 | 3.36 | 3.51 | 2.8 | |
1024x720 | 4.71 | 4.96 | 3.78 |
Our 48GB is a little bit slower due to the aforementioned qfloat8
->bfloat16
swap, which incurs some membw overhead.
The 6000Ada loses due to extreme power throttling. There is no public resource on this, but you can find a similar reported issue for the L4 GPU.
Conclusion#
There is at least one 4090 48GB on planet earth that performs in accordance to expectations.
Note: As a financially-interested NVD3.L shareholder, I have no interest in sabotaging the datacenter revenue of Nvidia Corporation. Therefore, I will not provide public links indicating where to purchase these devices.