
Meta has unveiled significant details regarding its innovative Catalina AI system, which leverages NVIDIA’s GB200 NVL72 technology, alongside advancements in Open Rack v3 and liquid cooling systems.
Revolutionizing Data Centers: Meta’s Custom NVIDIA GB200 NVL72 Blackwell Platform for the Catalina Pod
In 2022, Meta’s focus was primarily on GPU clusters that typically comprised around 6, 000 units, aimed mainly at supporting traditional ranking and recommendation algorithms. These clusters typically operated with loads spanning from 128 to 512 GPUs. However, a remarkable transformation has taken place over the past year, driven by the explosive rise of Generative AI (GenAI) and large language models (LLMs).

Fast forward to today, and Meta’s GPU clusters have expanded to an impressive scale of 16, 000 to 24, 000 GPUs, marking a fourfold increase. As of last year, the company operates over 100, 000 GPUs and continues to scale up. With software advancements such as their LLama model, Meta forecasts a staggering tenfold increase in the sizes of their clusters in the near future.

Meta initiated the Catalina project in close collaboration with NVIDIA, utilizing the NVL72 GPU solution as a foundational element. Alterations were made to tailor the system to their specific requirements, and both companies contributed reference designs for MGX and NVL72 to an open-source framework, allowing for extensive accessibility on the Open Compute Project website.

The Catalina system represents Meta’s cutting-edge deployments across their data centers, wherein each system configuration is termed a “pod.”This modular design enables the rapid scalability of systems by duplicating the basic framework.


A distinguishing feature of Meta’s custom NVL72 design is its dual IT racks, each forming a single scale-up domain of 72 GPUs. Consistency is maintained in the configuration of both racks, which house 18 compute trays positioned between the upper and lower sections, along with nine NV switches on each side. The integration of redundant cabling is critical to unify GPU resources across both racks, effectively establishing a single computing domain.

Each rack also accommodates large Air-Assisted Liquid Cooling (ALC) units designed to facilitate high-power density operations. This configuration enables Meta to efficiently implement liquid cooling systems in data centers throughout North America and globally.

With these dual racks, Meta can effectively double CPU counts and maximize memory capacity, allowing for up to 34 TB of LPDDR memory per rack, thus achieving a combined total of 48 TB of cache-coherent memory accessible by both GPUs and CPUs. Power supply units (PSUs) operate on either 480 volts or 277 volts single-phase, converting to 48 volts DC, which powers all server blades, networking devices, and NV switches within the architecture.





Additionally, the configuration features a power supply shelf at both the top and bottom of each rack, complemented by additional units at the base. Meta has implemented a specialized fiber path panel that manages all internal fiber cabling linked to the backend network, ensuring smooth connectivity to the end-point switches that facilitate the scale-up domain.

Supporting the robust infrastructure, Meta has integrated advanced technologies inherent to the NVIDIA NVL72 GB200 Blackwell system, along with unique enhancements like high-capacity power supplies and blades. Liquid cooling systems, coupled with the rack management controller (RMC), ensure that cooling protocols are efficiently managed while simultaneously monitoring for leaks.






This marked deployment of Meta’s high-capacity OpenRack v3 enhances the allocation of power within racks to a substantial 94 kW at 600A, making it compatible with advanced facilities featuring integrated liquid cooling systems. Managing this liquid flow is performed efficiently by the RMC, which monitors various components within the rack for potential leaks while simultaneously orchestrating the optimal operation of the cooling systems.

Moreover, Meta’s adoption of a disaggregated scheduled fabric enables the interconnection of multiple pods within single data facilities, facilitating a scalable model that can seamlessly link multiple buildings. This infrastructure is tailored for AI applications, enhancing inter-GPU communication and overall system flexibility.
Leave a Reply