
The latest MLPerf v5.1 AI inference benchmarks have witnessed the debut of groundbreaking chips from NVIDIA and AMD: the Blackwell Ultra GB300 and the Instinct MI355X. These powerful processors are generating considerable buzz in the tech community for their outstanding performance metrics.
NVIDIA Blackwell Ultra GB300 & AMD Instinct MI355X: A New Benchmark in AI Performance
MLCommons recently released its latest evaluation of AI performance through the MLPerf v5.1 benchmarks, revealing remarkable submissions, notably from NVIDIA and AMD. The Blackwell Ultra GB300 and Instinct MI355X stand out as the premier offerings in AI technology from their respective manufacturers. In this analysis, we will closely examine their capabilities as demonstrated through the benchmarks.
Blackwell Ultra GB300 Performance Highlights
In the DeepSeek R1 (Offline) category, NVIDIA’s GB300 outpaces its predecessor, the GB200, dramatically achieving a 45% performance increase in 72 GPU setups and a 44% boost in an 8 GPU configuration. These improvements align closely with NVIDIA’s projected performance gains.
In the DeepSeek R1 (Server) category, the Blackwell has made notable strides with a 25% increase in performance for 72 GPUs and a 21% boost in configurations with 8 GPUs.
AMD’s Instinct MI355X Enters the Arena
The AMD Instinct MI355X has also made substantial contributions, particularly in the Llama 3.1 405B (Offline) benchmarks. A comparative evaluation against the GB200 revealed a remarkable 27% performance increase, demonstrating AMD’s advancements in the AI sector.
Moreover, in a benchmark involving Llama 2 70B (Offline), the MI355X showcased impressive throughput, generating up to 648, 248 tokens per second with a 64-chip configuration and a striking 2.09x performance increase over the NVIDIA GB200 in an 8-chip setup.
NVIDIA has shared a detailed analysis of their benchmarks, including the various records achieved through the Blackwell Ultra GB300 platform. These results showcase a significant advancement in AI inference capabilities.

Comprehensive Record Table
MLPerf Inference Per-Accelerator Records | |||
Benchmark | Offline | Server | Interactive |
DeepSeek-R1 | 5, 842 tokens/second/GPU | 2, 907 tokens/second/GPU | ** |
Flame 3.1 405B | 224 tokens/second/GPU | 170 tokens/second/GPU | 138 tokens/second/GPU |
Call 2 70B 99.9% | 12, 934 tokens/second/GPU | 12, 701 tokens/second/GPU | 7, 856 tokens/second/GPU |
Call 2 70B 99% | 13, 015 tokens/second/GPU | 12, 701 tokens/second/GPU | 7, 856 tokens/second/GPU |
Llama 3.1 8B | 18, 370 tokens/second/GPU | 16, 099 tokens/second/GPU | 15, 284 tokens/second/GPU |
Stable Diffusion XL | 4.07 samples/second/GPU | 3.59 queries/second/GPU | ** |
Mixtral 8x7B | 16, 099 tokens/second/GPU | 16, 131 tokens/second/GPU | ** |
DLRMv2 99% | 87, 228 samples/second/GPU | 80, 515 samples/second/GPU | ** |
DLRMv2 99.9% | 48, 666 samples/second/GPU | 46, 259 queries/second/GPU | ** |
Whisper | 5, 667 tokens/second/GPU | ** | ** |
R-GAT | 81, 404 samples/second/GPU | ** | ** |
Retinanet | 1, 875 samples/second/GPU | 1, 801 queries/second/GPU | ** |
Furthermore, NVIDIA’s Blackwell Ultra has established new reasoning benchmarks at MLPerf, outperforming their previous Hopper architecture by a multiplier of 4.7x in offline mode and 5.2x in server configurations, indicating a substantial leap in efficacy.
DeepSeek-R1 Performance Comparison | ||
Architecture | Offline | Server |
Hopper | 1, 253 tokens/second/GPU | 556 tokens/second/GPU |
Blackwell Ultra | 5, 842 tokens/second/GPU | 2, 907 tokens/second/GPU |
Blackwell Ultra Advantage | 4.7x | 5.2x |
As we look forward to future MLPerf submissions, it’s anticipated that NVIDIA, AMD, and Intel will continue to enhance their platforms, striving for even greater performance levels in this competitive landscape.
Leave a Reply