Research Shows GPUHammer Can Decrease AI Model Accuracy on GDDR6 Memory GPUs From 80% to Only 0.1%

Research Shows GPUHammer Can Decrease AI Model Accuracy on GDDR6 Memory GPUs From 80% to Only 0.1%

Recent advancements in GPU technology have revealed critical vulnerabilities, particularly involving DRAM banks. One of the notable findings is the GPUHammer, a tool capable of drastically reducing GPU accuracy to below 1% on high-performance GPUs equipped with GDDR6 VRAM.

Toronto Researchers Identify RowHammer-Style Threats to NVIDIA RTX A6000, Compromising AI Model Reliability

Researchers from the University of Toronto have shed light on how RowHammer attacks compromise the integrity of AI models by causing bit flips within GPU memory. This RowHammer vulnerability not only affects conventional memory cells but poses a significant risk to GPU memory systems, as evidenced by their experiments.

The team specifically targeted the GDDR6 VRAM in the NVIDIA RTX A6000, demonstrating that by inducing bit flips in the DRAM banks, the GPU’s efficiency in processing AI models suffered considerably. Remarkably, this degradation occurred even in scenarios where hardware defenses, such as the DRAM-target refresh rate (TRR), were active. For instance, an alteration of a single bit in the FP16 value reduced DNN prediction accuracy from 80% down to a mere 0.1% on several key ImageNet models.

RTX A6000 Flips
Credit: gpuhammer.com

The process implemented by GPUHammer consists of three critical steps:

  • Reverse-Engineering DRAM Bank Mappings
  • Maximizing Hammering Efficiency
  • Synchronizing with DRAM Refresh Cycles

Detailed explanations of these methods are available on the researchers’ website, illustrating how they activated the single-bit flips across four DRAM banks by executing approximately 12, 000 activations for each flip. While the GDDR6 memory on the RTX A6000 was found vulnerable, other GPUs such as the RTX 3080 demonstrated resilience against such attacks.

Interestingly, no bit flips were detected in the NVIDIA RTX 5090 or in data center models like the A100 and H100, which utilize High Bandwidth Memory (HBM).For users of the RTX A6000, there’s no immediate cause for alarm; the GPUHammer’s effects can be largely mitigated by enabling Error-Correcting Code (ECC), which effectively identifies and rectifies these single-bit flips.

However, users should be aware that enabling ECC may lead to performance trade-offs. Reports indicate a potential reduction in the RTX A6000’s performance by as much as 10% during machine learning inference workloads, along with a reduction of usable VRAM capacity of up to 6.25%.NVIDIA has proactively addressed this issue by issuing a security advisory, recommending the activation of SYSTEM-LEVEL ECC on affected GPUs. It’s worth noting that many contemporary GPUs, including those from the Hopper and Blackwell families, come with ECC enabled by default.

For further details, please refer to these sources: GPUHammer and Tom’s Hardware.

Source & Images

Leave a Reply

Your email address will not be published. Required fields are marked *