NVIDIA Blackwell Ultra Advances Agentic AI Performance: Achieving 50x Higher Tokens/Watt and Enhanced Long-Context Workloads

NVIDIA has introduced its latest computing solution for hyperscalers: the Blackwell Ultra. Recent benchmarks of the GB300 NVL72 reveal its exceptional performance, particularly in low-latency and extensive context applications.

NVIDIA’s Blackwell Ultra AI Racks: Enhanced Agentic Performance Through NVLink Advances

The AI landscape has seen transformative changes since its surge in 2022, with a notable emphasis on agentic computing, powered by advanced applications and frameworks. For infrastructure providers like NVIDIA, the necessity for high memory bandwidth and performance is critical to meet the strict latency demands of these sophisticated systems. Their Blackwell Ultra series rises to this challenge effectively. In a recent assessment shared by NVIDIA via a blog post, the Blackwell Ultra demonstrated outstanding results on SemiAnalysis’s InferenceMAX benchmark.

A line graph titled 'DeepSeekR1 Throughput per MW' shows the GB300 NVL72 NVFP4 achieving significantly higher token throughput.

NVIDIA highlights a crucial metric, referred to as “token/watt, ” which is paramount in today’s hyperscaler development. The focus on both raw performance and throughput improvements is evident, with the GB300 NVL72 achieving a remarkable 50-fold increase in throughput per megawatt when compared to the previous generation Hopper GPUs. An illustrative comparison shows the optimal ‘deployed state’ of each respective architecture.

How does NVIDIA achieve such staggering throughput gains? The answer lies in its cutting-edge NVLink technology. The Blackwell Ultra boasts a 72-GPU configuration that unifies into a single NVLink fabric offering an impressive 130 TB/s of connectivity. In contrast, the Hopper series utilizes an 8-chip NVLink design, which, while effective, does not match the innovative architecture and layout of Blackwell Ultra. Additionally, the introduction of the NVFP4 precision format is pivotal, solidifying GB300’s dominance in terms of throughput.

A partially open server rack displays NVIDIA hardware components and cabling inside. — Image Credits: NVIDIA

With the rise of “agentic AI, ” NVIDIA’s assessments of the GB300 NVL72 also emphasize token costs along with the aforementioned upgrades. Team Green reports a significant 35-fold decrease in the cost per million tokens, positioning this system as the premier choice for inference tasks among frontier labs and hyperscalers. As scaling laws continue to evolve at an unprecedented rate, NVIDIA attributes these performance enhancements to their “extreme co-design”strategy, alongside what is now widely recognized as Huang’s Law.

A line graph titled 'GB300 NVL72 Delivers Large Leap for Long Context AI' shows GB300 NVL72 achieving a 1.5x lower cost per token.

When comparing the GB300 NVL72 to the Hopper series, it’s essential to acknowledge the nuanced differences in compute nodes and architectural designs. NVIDIA has also pitted the GB200 against the GB300 NVL72 to evaluate long-context workload performance. Contextual limitations remain a significant consideration for agents, as managing an extensive codebase can exponentially increase token usage. With Blackwell Ultra, NVIDIA is able to report up to 1.5 times lower costs per token and 2 times faster attention processing, making it exceptionally suited for agent-centric tasks.

As Blackwell Ultra begins to integrate into hyperscaler environments, these benchmarks represent some of the earliest evaluations of this architecture. Initial results suggest that NVIDIA has maintained robust performance scaling aligned with contemporary AI applications. Furthermore, with forthcoming advancements like those expected from Vera Rubin, the Blackwell generation may propel NVIDIA even further ahead in the competitive infrastructure landscape.

Source & Images