CoreWeave Achieves 6X GPU Throughput Comparison of NVIDIA GB300 NVL72 and H100 in DeepSeek R1

CoreWeave Achieves 6X GPU Throughput Comparison of NVIDIA GB300 NVL72 and H100 in DeepSeek R1

The newly launched NVIDIA Blackwell AI superchip, known as the GB300, significantly surpasses its predecessor, the H100 GPU, by optimizing tensor parallelism to provide much-enhanced throughput performance.

NVIDIA GB300: Enhanced Memory and Bandwidth Achieving Superior Throughput Over H100

The introduction of NVIDIA’s Blackwell-powered AI superchips marks a pivotal advancement in GPU technology. The GB300 represents NVIDIA’s most advanced product to date, showcasing remarkable improvements in computational capabilities, alongside increased memory capacity and bandwidth. These enhancements are critical for managing demanding AI tasks. A recent benchmark conducted by CoreWeave illustrates the GB300’s potential — it achieves markedly higher throughput through a reduction in tensor parallelism.

In the tests undertaken by CoreWeave utilizing the DeepSeek R1 reasoning model, a complex AI framework, a notable distinction emerged between the two platforms. Running the DeepSeek R1 model necessitated a cluster of 16 NVIDIA H100 GPUs, whereas just four GB300 GPUs operating on the NVIDIA GB300 NVL72 infrastructure sufficed to complete the same task. Remarkably, the GB300 system is capable of delivering 6 times the raw throughput per GPU, underscoring its superior performance in intricate AI workloads when compared to the H100.

Benchmark chart: 4x GB300 GPUs outpace 16x H100 GPUs in throughput by 6.5x tokens/s.
Image Credit: CoreWeave

The findings demonstrate a significant advantage for the GB300, which leverages a simplified 4-way tensor parallelism configuration. This reduction in parallelism enhances inter-GPU communication, while the superior memory capacity and bandwidth contribute to substantial performance improvements. The GB300 NVL72 platform benefits from high-bandwidth NVLink and NVSwitch interconnects, facilitating rapid data exchanges between GPUs.

This technological advancement translates into tangible benefits for users, enabling quicker token generation and reduced latency, thus allowing for more effective scaling of AI operations in enterprise environments. CoreWeave has emphasized the remarkable specifications of the NVIDIA GB300 NVL72 rack-scale system, which boasts a staggering 37 TB of memory capacity (with the potential to support up to 40 TB), ideally suited for managing large, complex AI models, complemented by interconnect capabilities that achieve up to 130 TB/s of memory bandwidth.

NVIDIA GB300 NVL72 specifications: GPUs, CPUs, memory bandwidth, tensor cores performance.

Ultimately, the NVIDIA GB300 goes beyond merely delivering impressive TFLOPs; it emphasizes efficiency in operations. By minimizing tensor parallelism, the GB300 reduces the communication overhead between GPUs, which typically hampers large-scale AI training and inference processes. As a result, businesses can now achieve significantly higher throughput with fewer GPUs, leading to cost reductions and improved scalability in their AI implementations.

News Source: CoreWeave

Source & Images

Leave a Reply

Your email address will not be published. Required fields are marked *