NVIDIA Blackwell Ultra GB300 AI Racks Excel in Long-Context DeepSeek Workloads, Outperforming GB200

NVIDIA has recently put its GB300 NVL72 AI racks through the paces using DeepSeek’s newest open-source models. The results, following extensive fine-tuning and optimized inference, indicate promising outcomes.

NVIDIA’s Blackwell Ultra Achieves Up to 1.5x Advantage Over GB200 NVL72 in Latency-Sensitive Tasks

NVIDIA’s latest innovation, the GB300 series, aims to deliver exceptional long-context performance, tapping into the growing demand for agentic AI capabilities. As previously discussed, the Blackwell Ultra architecture boasts an impressive 50x increase in throughput per megawatt when compared to its predecessor, Hopper GPUs through a unique co-design strategy. Recent testing by the Large Model Systems Organization (LMSYS) has showcased the GB300 NVL72’s long-context inference abilities, yielding highly encouraging results. Notably, the testing also involved infrastructure-level software routing, which we will explore further.

In the realm of long-context workloads, the reliance on GPU VRAM becomes more pronounced. To address this, the LMSYS team introduced a technique known as PD (Prefill-Decode) Disaggregation. This method efficiently distributes workloads across various hardware nodes, preventing potential bottlenecks. Essentially, the prefill phase—focused on prompt processing—and the decode phase, which involves token generation, benefit significantly from this disaggregation approach, resulting in enhanced throughput at scale.

A bar chart titled 'GB300 vs GB200: Max TPS/GPU' shows GB300 outperforming GB200 with 226.2 TPS/User when MTP is off — Image Credits: LMSYS

The LMSYS team further implemented various optimization strategies, including dynamic chunking for prompt responses optimized under long-context settings, along with effective KV capacity translation. The key performance indicators observed during testing include:

Comparative Analysis: NVIDIA’s GB300 NVL72 vs. GB200 NVL72

Peak Throughput: 1.53x advantage with 226.2 TPS/GPU (Tokens Per Second)
User Speed Improvement: 1.87x increase in TPS/User due to Multi-Token Prediction (MTP)
Latency Enhancement: A 1.58x lower latency observed

The evaluations indicate that the GB300 maintains a 1.4x to 1.5x lead over the GB200, particularly in latency-critical situations. Given its specialization in agentic applications, the Blackwell Ultra architecture positions itself as a strategic choice for high-performance workloads. However, it’s important to note that industry discussions on Total Cost of Ownership (TCO) have yet to surface, especially as deployment expenses for GB300 have increased concurrently.

A partially open server rack displays NVIDIA hardware components and cabling inside — Image Credits: NVIDIA

NVIDIA’s strategy not only focuses on architectural innovations but also addresses specific industry challenges. Notably, the enhancements in latency figures for the Blackwell Ultra architecture make it an attractive option for hyperscalers and neoclouds in agentic computing environments.

Source & Images