NVIDIA Blackwell Ultra GB300 AI Racks: Dominating Long-Context DeepSeek Workloads with Superior Performance Over GB200

Recent evaluations of NVIDIA’s GB300 NVL72 AI racks utilizing DeepSeek’s latest open-source models indicate significant promise following fine-tuning and optimized inference strategies.

NVIDIA’s Blackwell Ultra Outperforms GB200 NVL72 in Latency-Sensitive Tasks

NVIDIA’s development of the GB300 architecture primarily targets optimal long-context performance, allowing it to leverage the growing demand for agentic AI solutions. A prior analysis highlighted that Blackwell Ultra delivers an extraordinary 50x boost in throughput per megawatt compared to its predecessor, Hopper GPUs, achieved through a meticulously designed co-design methodology. Recently, the Large Model Systems Organization (LMSYS) has conducted tests focused on long-context inference, showcasing remarkably encouraging results. Notably, this testing incorporates infrastructure-level software routing, which we will delve into further.

In handling long-context workloads, the demand often shifts toward GPU VRAM. To address this, the LMSYS team incorporated PD (Prefill-Decode) Disaggregation, an effective strategy for maintaining extensive token contexts across various compute nodes. This innovative approach mitigates bottlenecks by distributing tasks across distinct hardware components. The prefill phase, dealing with prompt processing, alongside token generation in the decode phase, benefits significantly from disaggregation, resulting in enhanced overall throughput at scale.

A bar chart titled 'GB300 vs GB200: Max TPS/GPU' shows GB300 outperforming GB200 with 226.2 TPS/User when MTP is off. — Image Credits: LMSYS

In addition to PD Disaggregation, the LMSYS team utilized several optimization techniques to enhance performance. These include dynamic chunking for improved prompt response times in long-context scenarios and effective translation of key-value capacities. The primary metrics evaluated included throughput, capacity, and latency ratios.

Comparison of NVIDIA’s GB300 NVL72 and GB200 NVL72

1.53x Increased Peak Throughput: 226.2 TPS/GPU (Tokens Per Second)
1.87x Enhanced User Speed: Substantial rise in TPS/User thanks to MTP (Multi-Token Prediction).
1.58x Improvement in Latency: Notable reduction in latency metrics.

The LMSYS findings indicate that the GB300 consistently achieves a 1.4x to 1.5x advantage over the GB200, particularly in scenarios sensitive to latency. This positioning aligns well with the increasing focus on agentic workloads, suggesting that Blackwell Ultra is exceptionally suited to meet these demands. Despite its advantages in latency and throughput, comprehensive Total Cost of Ownership (TCO) figures remain undisclosed, especially considering the rising deployment costs associated with the GB300.

A partially open server rack displays NVIDIA hardware components and cabling inside. — Image Credits: NVIDIA

NVIDIA’s progressive approach emphasizes not only architectural advancements but also solutions to industry-specific challenges. In the realm of Blackwell Ultra, significant improvements in latency metrics reinforce its emerging supremacy among hyperscalers and neocloud providers within the agentic AI sector.

Source & Images