
NVIDIA has unveiled its cutting-edge AI chip, the Blackwell Ultra GB300, boasting a remarkable performance enhancement of 50% over its predecessor, the GB200, and an impressive 288 GB of memory.
Introducing NVIDIA’s Blackwell Ultra “GB300”: A Revolutionary AI Chip
Recently, NVIDIA published a detailed article outlining the specifications and capabilities of the Blackwell Ultra GB300. This state-of-the-art chip is now in mass production, being supplied to selected clients. The Blackwell Ultra represents a significant upgrade in performance and features compared to the previous Blackwell models.

Drawing parallels to NVIDIA’s Super series which improved upon the original RTX gaming cards, the Ultra series enhances prior AI chip offerings. While earlier lines like Hopper and Volta lacked Ultra features, their advancements laid the groundwork for the current innovations. Furthermore, substantial improvements are also available for non-Ultra models through software updates and optimization efforts.

The Blackwell Ultra GB300 is an advanced iteration that combines two Reticle-sized dies connected by NVIDIA’s high-bandwidth NV-HBI interface, operating as a unified GPU. Built on TSMC’s 4NP process technology (an optimized version of its 5nm node), the chip houses an impressive 208 billion transistors and delivers extraordinary performance with a bandwidth of 10 TB/s between the two dies.

The GPU is equipped with 160 streaming multiprocessors (SMs), featuring a total of 128 CUDA cores each. It includes four 5th Generation Tensor cores, which support FP8, FP6, and NVFP4 precision computing. This design leads to a combined total of 20, 480 CUDA cores and 640 Tensor cores, alongside 40 MB of Tensor memory (TMEM).
Feature | Hopper | Blackwell | Blackwell Ultra |
---|---|---|---|
Manufacturing process | TSMC 4N | TSMC 4NP | TSMC 4NP |
Transistors | 80B | 208B | 208B |
Dies per GPU | 1 | 2 | 2 |
NVFP4 dense | sparse performance | – | 10 | 20 PetaFLOPS | 15 | 20 PetaFLOPS |
FP8 dense | sparse performance | 2 | 4 PetaFLOPS | 5 | 10 PetaFLOPS | 5 | 10 PetaFLOPS |
Attention acceleration(SFU EX2) | 4.5 TeraExponentials/s | 5 TeraExponentials/s | 10.7 TeraExponentials/s |
Max HBM capacity | 80 GB HBM (H100) 141 GB HBM3E (H200) | 192 GB HBM3E | 288 GB HBM3E |
Max HBM bandwidth | 3.35 TB/s (H100) 4.8 TB/s (H200) | 8TB/s | 8TB/s |
NVLink bandwidth | 900GB/s | 1, 800 GB/s | 1, 800 GB/s |
Max power (TGP) | Up to 700W | Up to 1, 200W | Up to 1, 400W |
The innovations in the 5th Gen Tensor Cores are pivotal for AI computations. NVIDIA has consistently advanced these cores, resulting in:
- NVIDIA Volta: Introduced 8-thread MMA units and support for FP16 calculations.
- NVIDIA Ampere: Enhanced with full warp-wide MMA, BF16, and TensorFloat-32.
- NVIDIA Hopper: Introduced Warp-group MMA across 128 threads and Transformer Engine with FP8 support.
- NVIDIA Blackwell: Featured 2nd Gen Transformer Engine with enhanced FP8 and FP6 compute capabilities.

The Blackwell Ultra chip significantly upgrades memory capacity, increasing from a maximum of 192 GB in the Blackwell GB200 models to an impressive 288 GB of HBM3e. This leap permits support for massive multi-trillion parameter AI models. Its memory architecture comprises eight stacks with a 512-bit controller operating at 8 TB/s, enabling:
- Complete model accommodation: Ability to handle 300 billion+ parameter models without offloading memory.
- Extended context lengths: Enhanced KV cache capacity for transformer applications.
- Improved computational efficiency: Elevated compute-to-memory ratios for various workloads.

The Blackwell architecture features robust interconnects including NVLINK, NVLINK-C2C, and a PCIe Gen6 x16 interface, offering the following specifications:
- Per-GPU Bandwidth: 1.8 TB/s bidirectional (18 links x 100 GB/s).
- Performance Improvement: 2x increases over NVLink 4 (compared to Hopper).
- Maximum Topology: Supports up to 576 GPUs in a non-blocking compute fabric.
- Rack-Scale Integration: Enables configurations of 72 GPUs with 130 TB/s aggregate bandwidth.
- PCIe Interface: Gen6 with 16 lanes providing 256 GB/s bidirectional throughput.
- NVLink-C2C: Facilitates communication between CPU and GPU with memory coherency at 900 GB/s.
Interconnect | Hopper GPU | Blackwell GPU | Blackwell Ultra GPU |
---|---|---|---|
NVLink (GPU-GPU) | 900 | 1, 800 | 1, 800 |
NVLink-C2C (CPU-GPU) | 900 | 900 | 900 |
PCIe Interface | 128 (Gen 5) | 256 (Gen 6) | 256 (Gen 6) |
NVIDIA’s Blackwell Ultra GB300 achieves a remarkable 50% increase in Dense Low Precision Compute output with the adoption of the new NVFP4 standard, offering near FP8 accuracy with minimal discrepancies (less than 1%).This advancement also reduces memory requirements by up to 1.8x compared to FP8, and 3.5x compared to FP16.

The Blackwell Ultra also integrates sophisticated scheduling management alongside enterprise-level security features, including:
- Enhanced GigaThread Engine: An advanced scheduler that optimizes workload distribution, enhancing context-switching performance across all 160 SMs.
- Multi-Instance GPU (MIG): Ability to partition GPUs into various MIG instances, allowing tailored memory allocations for secure multi-tenancy.
- Confidential Computing: Provisions for secure handling of sensitive AI models, leveraging hardware-based Trusted Execution Environment (TEE) and secure NVLink operations without significant performance loss.
- Advanced NVIDIA Remote Attestation Service (RAS): An AI-driven monitoring system that enhances reliability by predicting failures and optimizing maintenance.
Performance efficiency significantly improves with the Blackwell Ultra GB300, providing superior TPS/MW compared to the GB200, as illustrated in the subsequent charts:




In summary, NVIDIA continues to lead in AI technology, exemplified by the Blackwell and Blackwell Ultra architectures. Their commitment to enhancing software support and optimizations ensures a strong competitive edge, backed by ongoing research and development that promises to keep them at the forefront of the industry for years to come.
Leave a Reply