NVIDIA Blackwell Ultra “GB300” GPU: Unveiling the Fastest AI Chip with Dual Reticle, 20K+ Cores, 288 GB HBM3e Memory at 8 TB/s, 50% Faster Than GB200

NVIDIA Blackwell Ultra “GB300” GPU: Unveiling the Fastest AI Chip with Dual Reticle, 20K+ Cores, 288 GB HBM3e Memory at 8 TB/s, 50% Faster Than GB200

NVIDIA has unveiled its cutting-edge AI chip, the Blackwell Ultra GB300, boasting a remarkable performance enhancement of 50% over its predecessor, the GB200, and an impressive 288 GB of memory.

Introducing NVIDIA’s Blackwell Ultra “GB300”: A Revolutionary AI Chip

Recently, NVIDIA published a detailed article outlining the specifications and capabilities of the Blackwell Ultra GB300. This state-of-the-art chip is now in mass production, being supplied to selected clients. The Blackwell Ultra represents a significant upgrade in performance and features compared to the previous Blackwell models.

NVIDIA Blackwell Ultra

Drawing parallels to NVIDIA’s Super series which improved upon the original RTX gaming cards, the Ultra series enhances prior AI chip offerings. While earlier lines like Hopper and Volta lacked Ultra features, their advancements laid the groundwork for the current innovations. Furthermore, substantial improvements are also available for non-Ultra models through software updates and optimization efforts.

NVIDIA Blackwell Ultra GPU diagram showing detailed architecture and connectivity specs.

The Blackwell Ultra GB300 is an advanced iteration that combines two Reticle-sized dies connected by NVIDIA’s high-bandwidth NV-HBI interface, operating as a unified GPU. Built on TSMC’s 4NP process technology (an optimized version of its 5nm node), the chip houses an impressive 208 billion transistors and delivers extraordinary performance with a bandwidth of 10 TB/s between the two dies.

Diagram of NVIDIA Streaming Multiprocessor architecture with CUDA and tensor cores.

The GPU is equipped with 160 streaming multiprocessors (SMs), featuring a total of 128 CUDA cores each. It includes four 5th Generation Tensor cores, which support FP8, FP6, and NVFP4 precision computing. This design leads to a combined total of 20, 480 CUDA cores and 640 Tensor cores, alongside 40 MB of Tensor memory (TMEM).

Feature Hopper Blackwell Blackwell Ultra
Manufacturing process TSMC 4N TSMC 4NP TSMC 4NP
Transistors 80B 208B 208B
Dies per GPU 1 2 2
NVFP4 dense | sparse performance 10 | 20 PetaFLOPS 15 | 20 PetaFLOPS
FP8 dense | sparse performance 2 | 4 PetaFLOPS 5 | 10 PetaFLOPS 5 | 10 PetaFLOPS
Attention acceleration(SFU EX2) 4.5 TeraExponentials/s 5 TeraExponentials/s 10.7 TeraExponentials/s
Max HBM capacity 80 GB HBM (H100) 141 GB HBM3E (H200) 192 GB HBM3E 288 GB HBM3E
Max HBM bandwidth 3.35 TB/s (H100) 4.8 TB/s (H200) 8TB/s 8TB/s
NVLink bandwidth 900GB/s 1, 800 GB/s 1, 800 GB/s
Max power (TGP) Up to 700W Up to 1, 200W Up to 1, 400W

The innovations in the 5th Gen Tensor Cores are pivotal for AI computations. NVIDIA has consistently advanced these cores, resulting in:

  • NVIDIA Volta: Introduced 8-thread MMA units and support for FP16 calculations.
  • NVIDIA Ampere: Enhanced with full warp-wide MMA, BF16, and TensorFloat-32.
  • NVIDIA Hopper: Introduced Warp-group MMA across 128 threads and Transformer Engine with FP8 support.
  • NVIDIA Blackwell: Featured 2nd Gen Transformer Engine with enhanced FP8 and FP6 compute capabilities.
Comparison of GPU memory: Hopper H100 80GB, Hopper H200 141GB, Blackwell 192GB, Blackwell Ultra 288GB.

The Blackwell Ultra chip significantly upgrades memory capacity, increasing from a maximum of 192 GB in the Blackwell GB200 models to an impressive 288 GB of HBM3e. This leap permits support for massive multi-trillion parameter AI models. Its memory architecture comprises eight stacks with a 512-bit controller operating at 8 TB/s, enabling:

  • Complete model accommodation: Ability to handle 300 billion+ parameter models without offloading memory.
  • Extended context lengths: Enhanced KV cache capacity for transformer applications.
  • Improved computational efficiency: Elevated compute-to-memory ratios for various workloads.
Bar chart comparing Dense FP8 and NVFP4 GPU performance levels.

The Blackwell architecture features robust interconnects including NVLINK, NVLINK-C2C, and a PCIe Gen6 x16 interface, offering the following specifications:

  • Per-GPU Bandwidth: 1.8 TB/s bidirectional (18 links x 100 GB/s).
  • Performance Improvement: 2x increases over NVLink 4 (compared to Hopper).
  • Maximum Topology: Supports up to 576 GPUs in a non-blocking compute fabric.
  • Rack-Scale Integration: Enables configurations of 72 GPUs with 130 TB/s aggregate bandwidth.
  • PCIe Interface: Gen6 with 16 lanes providing 256 GB/s bidirectional throughput.
  • NVLink-C2C: Facilitates communication between CPU and GPU with memory coherency at 900 GB/s.
Interconnect Hopper GPU Blackwell GPU Blackwell Ultra GPU
NVLink (GPU-GPU) 900 1, 800 1, 800
NVLink-C2C (CPU-GPU) 900 900 900
PCIe Interface 128 (Gen 5) 256 (Gen 6) 256 (Gen 6)

NVIDIA’s Blackwell Ultra GB300 achieves a remarkable 50% increase in Dense Low Precision Compute output with the adoption of the new NVFP4 standard, offering near FP8 accuracy with minimal discrepancies (less than 1%).This advancement also reduces memory requirements by up to 1.8x compared to FP8, and 3.5x compared to FP16.

Diagram of Blackwell KV cache attention mechanism with batched MatMul, Softmax, and speedup indicators.

The Blackwell Ultra also integrates sophisticated scheduling management alongside enterprise-level security features, including:

  • Enhanced GigaThread Engine: An advanced scheduler that optimizes workload distribution, enhancing context-switching performance across all 160 SMs.
  • Multi-Instance GPU (MIG): Ability to partition GPUs into various MIG instances, allowing tailored memory allocations for secure multi-tenancy.
  • Confidential Computing: Provisions for secure handling of sensitive AI models, leveraging hardware-based Trusted Execution Environment (TEE) and secure NVLink operations without significant performance loss.
  • Advanced NVIDIA Remote Attestation Service (RAS): An AI-driven monitoring system that enhances reliability by predicting failures and optimizing maintenance.

Performance efficiency significantly improves with the Blackwell Ultra GB300, providing superior TPS/MW compared to the GB200, as illustrated in the subsequent charts:

Graph of architecture impact on inference performance and Pareto frontier user experience simulation.Chart on AI architecture's impact on inference performance and user experience at Pareto Frontier.Graph comparing AI performance: throughput vs.response speed, highlighting architecture impact.Graph comparing AI inference performance by architecture on Pareto frontier.

In summary, NVIDIA continues to lead in AI technology, exemplified by the Blackwell and Blackwell Ultra architectures. Their commitment to enhancing software support and optimizations ensures a strong competitive edge, backed by ongoing research and development that promises to keep them at the forefront of the industry for years to come.

Source & Images

Leave a Reply

Your email address will not be published. Required fields are marked *