Google challenges Nvidia Blackwell GPUs with latest Trillium TPUs

Google’s Trillium TPUs: A New Era in AI Acceleration

A decade ago, Google embarked on its journey to create custom AI accelerators known as Tensor Processing Units (TPUs). In early 2023, the tech giant unveiled its sixth-generation TPU, named Trillium, which sets new benchmarks in both performance and efficiency, surpassing previous models. Today, Google announced that Trillium TPUs are now universally accessible to Google Cloud customers, simultaneously revealing that these powerful TPUs were instrumental in training their latest model, Gemini 2.0.

Breaking Into the AI Developer Ecosystem

Nvidia’s GPUs have become a dominant choice amongst AI developers, not only due to their exceptional hardware but also because of robust software support. To foster a similar enthusiasm for Trillium TPUs, Google has made significant enhancements to its software framework. This includes optimizations to the XLA compiler as well as popular AI frameworks such as JAX, PyTorch, and TensorFlow, allowing developers to maximize cost-effectiveness in AI training, tuning, and deployment.

Key Improvements in Trillium TPUs

Trillium TPUs offer a range of substantial improvements over the previous generation, including:

Training performance increased by more than 4x

Inference throughput enhanced by up to 3x

Energy efficiency boosted by 67%

Peak compute performance per chip elevated by an impressive 4.7x

High Bandwidth Memory (HBM) capacity doubled

Interchip Interconnect (ICI) bandwidth also doubled

Capability to deploy 100,000 Trillium chips in a unified Jupiter network fabric

Training performance per dollar improved by up to 2.5x and inference performance by up to 1.4x

Scalability and Availability

Google has reported that Trillium TPUs can achieve an impressive 99% scaling efficiency with a setup of 12 pods consisting of 3,072 chips, and 94% efficiency across 24 pods using 6,144 chips, making it suitable for pre-training large models like GPT-3 175 billion parameters.

Currently, Trillium is available for deployment in key regions including North America (US East), Europe (West), and Asia (Northeast). For users interested in evaluation, the cost starts at $2.7000 per chip-hour. Additionally, longer-term commitments offer reduced rates of $1.8900 per chip-hour for one year and $1.2200 per chip-hour for a three-year commitment.

Conclusion

With its scalability and enhanced software capabilities, Trillium signifies a substantial advancement in Google’s cloud AI infrastructure strategy, positioning it as a formidable competitor in the evolving market of AI accelerators.

Source & Images