Microsoft Azure’s Ultra Upgrade Featuring NVIDIA GB300 “Blackwell Ultra” GPUs: 4600 GPUs Powering AI Models with Over a Trillion Parameters

Microsoft Azure’s Ultra Upgrade Featuring NVIDIA GB300 “Blackwell Ultra” GPUs: 4600 GPUs Powering AI Models with Over a Trillion Parameters

Microsoft has made a significant announcement regarding its Azure platform, unveiling its first large-scale production cluster that integrates NVIDIA’s cutting-edge GB300 “Blackwell Ultra”GPUs. This advance is specifically designed for handling extremely large AI models.

NVIDIA GB300 “Blackwell Ultra”: Enhancing AI Training in Microsoft’s Azure Platform

The Azure framework has been upgraded to include the Blackwell Ultra, featuring a robust deployment of over 4, 600 GPUs built on NVIDIA’s advanced GB300 NVL72 architecture. This setup utilizes state-of-the-art InfiniBand interconnect technology, significantly boosting Microsoft’s capabilities to deploy hundreds of thousands of Blackwell Ultra GPUs across its global data centers, all dedicated to AI workloads.

According to Microsoft, the deployment of the Azure cluster equipped with NVIDIA GB300 NVL72 “Blackwell Ultra”GPUs can dramatically decrease model training times from several months to mere weeks. This advancement allows the training of models that consist of hundreds of trillions of parameters. NVIDIA has also demonstrated leading performance in inference metrics, as evidenced by numerous MLPerf benchmarks and the recent InferenceMAX AI tests.

The newly launched Azure ND GB300 v6 virtual machines (VMs) are optimized for a variety of advanced applications, including reasoning models, agentic AI systems, and multimodal generative AI tasks. Each rack in this infrastructure accommodates 18 VMs, with each equipped with 72 GPUs. The following specifications highlight the performance capabilities:

  • 72 NVIDIA Blackwell Ultra GPUs paired with 36 NVIDIA Grace CPUs.
  • 800 gigabits per second (Gbps) cross-rack scale-out bandwidth via cutting-edge NVIDIA Quantum-X800 InfiniBand.
  • 130 terabytes (TB) per second of NVIDIA NVLink bandwidth per rack.
  • 37 TB of high-speed memory.
  • Up to 1, 440 petaflops (PFLOPS) of FP4 Tensor Core performance.
Microsoft Azure Gets An Ultra Upgrade With NVIDIA's GB300

At the rack level, NVLink and NVSwitch improve memory allocation and bandwidth, enabling an astounding 130TB per second of intra-rack data transfer while connecting 37TB of fast memory. This architectural innovation transforms each rack into an integrated unit, delivering increased inference throughput and lower latency for larger models and extended context windows. This enhancement supports agentic and multimodal AI systems, making them more agile and scalable than ever before.

To extend capabilities beyond individual racks, Azure employs a high-performance fat-tree network architecture facilitated by NVIDIA Quantum-X800 Gbps InfiniBand. This design ensures efficient scaling for ultra-large model training to tens of thousands of GPUs while minimizing communication overhead. The reductions in synchronization overhead further enable optimal GPU utilization, allowing for accelerated research cycles and cost efficiencies despite the intensive computational demands associated with AI training. Azure’s specially engineered stack, which includes custom protocols and in-network computing capabilities, guarantees high reliability and effective resource utilization. Technologies like NVIDIA SHARP enhance collective operation speeds and double effective bandwidth through on-switch computations, thereby facilitating more efficient large-scale training and inference.

Furthermore, Azure’s innovative cooling techniques incorporate standalone heat exchange units and advanced facility cooling systems, aimed at reducing water consumption while ensuring thermal stability within these dense, high-performance clusters like the GB300 NVL72. Continuous development and adaptation of power distribution models also support the high energy requirements and dynamic load balancing demands inherent to the ND GB300 v6 VM class of GPU clusters.

via Microsoft

As highlighted by NVIDIA, this collaboration between Microsoft Azure and NVIDIA marks a pivotal moment in the United States’ lead in the AI sector. Customers can now access and leverage these groundbreaking Azure VMs for their projects.

Source & Images

Leave a Reply

Your email address will not be published. Required fields are marked *