NVIDIA GeForce RTX 5090 & RTX PRO 6000 GPUs Hit by Virtualization Bug, Full System Reboot Needed for Recovery

NVIDIA GeForce RTX 5090 & RTX PRO 6000 GPUs Hit by Virtualization Bug, Full System Reboot Needed for Recovery

NVIDIA’s premier graphics processing units, the GeForce RTX 5090 and the RTX PRO 6000, are reportedly facing a troubling issue that renders them unresponsive during virtualization operations.

Critical Virtualization Issues Found in NVIDIA’s Blackwell GPUs

CloudRift, a leading GPU cloud service for developers, was the first to highlight the instability surrounding NVIDIA’s high-performance graphics cards. They observed that after just a few days of use in virtual machine (VM) environments, these GPUs exhibit complete unresponsiveness. Notably, once the problem occurs, access to the affected GPUs is only restored by rebooting the node system. This alarming issue appears to be restricted to the RTX 5090 and RTX PRO 6000 models, leaving other GPUs such as the RTX 4090, Hopper H100s, and the Blackwell-based B200s unaffected for the time being.

The crux of the issue arises when the GPU is allocated to a VM through the VFIO device driver. Following a Function Level Reset (FLR), the GPU fails to respond, causing a kernel ‘soft lock’ that effectively halts operations on both the host and client systems. To resolve the deadlock, a reboot of the host machine is required, creating significant complications for CloudRift due to the high number of guest machines they manage.

Error messages related to RTX 5090 and RTX PRO 6000 during VM operations.
Image Credits: CloudRift

This issue extends beyond CloudRift. A user on the Proxmox forums reported a similar experience, where a complete system crash occurred after shutting down a Windows client. Remarkably, NVIDIA has acknowledged the situation, confirming that they have successfully reproduced the issue and are actively working towards a solution. Official confirmation from NVIDIA is still awaited, but early indications suggest the issue is primarily associated with their Blackwell architecture GPUs.

As part of addressing this critical challenge, CloudRift has announced a bug bounty of $1, 000 to incentivize developers to either fix or mitigate the issue. Given the importance of these GPUs for critical AI workloads, prompt action from NVIDIA is anticipated as the pressure mounts to resolve this situation swiftly.

Source & Images

Leave a Reply

Your email address will not be published. Required fields are marked *