In-Depth Look at AMD RDNA 4 Architecture: New Compute Units, Enhanced Raytracing Cores, AI Features, and Path Tracing Capabilities

In-Depth Look at AMD RDNA 4 Architecture: New Compute Units, Enhanced Raytracing Cores, AI Features, and Path Tracing Capabilities

AMD has officially introduced comprehensive architectural details regarding its upcoming RDNA 4 GPU architecture, which has been meticulously crafted for the Radeon RX 9000 series.

Introducing AMD RDNA 4: A Gamer-Centric GPU Revolution

Following the success of the previous RDNA 3 and its enhanced RDNA 3.5 variant, the RDNA 4 architecture has generated considerable excitement among enthusiasts. Although it lacks ultra-enthusiast models, the RDNA 4 architecture introduces significant improvements aimed specifically at enhancing gaming performance.

AMD RDNA 4 Architecture Overview

This latest architecture features several key enhancements:

  • Intensive optimization for demanding gaming scenarios
  • Enhanced rasterization and compute efficiency
  • Significant advancements in ray tracing performance
  • Comprehensive machine learning capabilities
  • Improved bandwidth efficiency across all applications
  • Multimedia enhancements tailored for gamers and content creators
AMD RDNA 4 Architecture Improvements

In comparison to RDNA 2, RDNA 4 GPUs deliver nearly double the rasterization performance, up to 2.5 times better ray tracing capabilities, and a striking 3.5 times improvement in machine learning workloads on a per compute unit basis. Let’s delve into the architectural components that make up RDNA 4.

Core Innovations in RDNA 4

The centerpiece of the RDNA 4 GPU architecture is the new Compute Engine.

RDNA 4 Compute Engine

The revamped Compute Units (CUs) boast dual SIMD32 vector units and enhanced matrix operations, offering:

  • Increased rates for 2x-16b & 4x-8b/4b dense matrices
  • Structured sparsity at a 4:2 ratio for over 2x improvement
  • Introduction of new 8b floating-point data types
  • Matrix loading with transpose capabilities

RDNA 4 also includes substantial shading improvements, allowing RDNA 4 shades to dynamically allocate registers. This innovation enables the CUs to request and release registers as needed, thereby optimizing memory latency and enhancing overall core efficiency.

Dynamic Register Allocation

The scalar unit enhancements introduce new Float32 operations alongside improved scheduling which includes split barriers, accelerated spill/fill processes, and enhanced instruction prefetch capabilities.

RDNA 4 Scalar Unit Improvements

Significantly, the 3rd generation ray tracing units now offer doubled ray intersection rates, enhanced BVH compression, and optimized ray traversal and shading. Each ray accelerator has been upgraded with:

  • Increased box and triangle intersection units
  • Hardware instance transformations
  • Improved ray tracing stack management
  • Enhanced BVH8 and node compression
  • Oriented bounding boxes for increased efficiency
Ray Tracing ImprovementsRay Tracing Enhanced FeaturesRay Tracing Architecture InnovationsRay Tracing EnhancementsImproved Ray Processing

These upgrades lead to considerably lower memory consumption for BVH. RDNA 4 achieves an average memory requirement reduction to below 60% of what was necessary for RDNA 3, largely due to its innovative 8-wide structure.

Moreover, AMD has introduced a new method to minimize traversal costs by encoding rotations for each box, allowing for tighter bounding of geometry. This design approach decreases traversal steps and peaks, enhancing performance efficiency significantly by 10%.Consequently, RDNA 4’s CUs provide twice the ray traversal efficacy compared to RDNA 3 under consistent clock speeds and bandwidth.

An upgraded Command Processor features enhanced packet accelerators, while the Cache has seen substantial improvements. The architecture now includes up to 64 MB of 3rd Gen Infinity Cache, 8 MB of L2 cache, and 2MB of aggregate CU cache. RDNA 4 retains GDDR6 compatibility, but with an upgrade to faster speeds reaching up to 20.00 Gbps and a maximum capacity of 16 GB across a 256-bit bus interface. Enhanced memory compression techniques also alleviate the bandwidth demands.

RDNA 4 Memory Architecture

In the realm of artificial intelligence, AMD utilizes its 3rd Generation Matrix Acceleration engine, which features improved tensor rates, new 8b floating-point data types, structured sparsity support, and machine learning-enhanced resolution upscaling.

AI and ML EnhancementsEnhanced Tensor SupportMachine Learning Accelerated GraphicsAI-Driven Image Processing

When examining the image generation capabilities (SDXL 1.5) in normalized conditions, RDNA 4 CUs demonstrate a remarkable 2x enhancement compared to RDNA 3.

Image Generation PerformanceVisual Rendering EnhancementsImage Production CapabilityAdvanced visual technology

The Media Engine transitions to a dual-width format, equipped with upgraded encode/decode engines, resulting in quality improvements of up to 25% in AVC, enhancements in H.264 and H.265 encoding, and a doubling in AV1 throughput. This engine is also optimized for low-latency streaming environments. Furthermore, the Radiance Display Engine now accommodates DisplayPort 2.1a and HDMI 2.1b outputs, along with a refreshed scaling and sharpening mechanism.

Exploring the RDNA 4 GPU Architecture: The Navi 48 Die

The RDNA 4 block diagram showcases the full Navi 48 GPU WeU, which is built on TSMC’s 4nm process node, housing approximately 53.9 billion transistors within a chip area of 356.5 mm². This GPU architecture complies fully with PCIe Gen5 standards.

Let’s dissect the Navi 48 GPU (Radeon RX 9070 XT), consisting of four shader engines, each housing multiple “Dual Compute Units”instead of WGPs. Each Dual Compute Unit contains two Compute Units, leading to a configuration of eight DCUs or 16 CUs per Shader Engine. This totals 32 DCUs or 64 CUs on the chip, culminating in a staggering 4096 stream processors or shader units.

Navi 48 GPU Architecture

Each DCU is equipped with two ray accelerator engines, translating to 16 RAs per Shader Engine and 64 RAs total. Furthermore, every DCU incorporates four Matrix Acceleration Engines, amounting to 32 MAs per Shader Engine and 128 MAs in total. Shader Engines also contain four RB+ blocks, a rasterizer engine, and a primitive unit block. The chip design features four sections of 3rd Gen Infinity Caches and four 4×16-bit memory controllers positioned around the periphery of the GPU.

In the center of the chip reside the L2 caches, which encompass two Geometry processors, two Asynchronous Compute Engines (ACE), and one each of Hardware Scheduler (HWS) and Direct Memory Access (DMA).Connectivity across the architecture is achieved through Infinity Fabric.

The Future of Path Tracing in Gaming with AMD

Ray tracing, despite its current popularity in PC gaming, is often viewed as a traditional approach. While it enhances visual realism by simulating reflections, shadows, and refractions, a newer more sophisticated technique called Path Tracing has emerged, gaining traction especially in high-end gaming scenarios. Path Tracing calculates every potential path of light for even greater realism.

Path Tracing Graphics Advances

NVIDIA has successfully implemented Path Tracing in graphically intensive titles like Cyberpunk 2077 and Alan Wake II, showcasing stunning visuals. This was made feasible through advanced techniques such as AI-assisted upscaling and frame generation, along with the development of new ray reconstruction technology that supersedes traditional in-engine denoisers by relying on AI and machine learning.

AMD is aligning its RDNA 4 Path Tracing capabilities with a similar strategy, deploying its Neural Supersampling and Denoising technologies to achieve enhanced graphical fidelity.

Enhanced Media and Display Technologies

Addressing the Media and Display components, AMD has introduced substantial upgrades to boost game streaming and recording performance:

  • A 25% improvement in AVC low-latency encode quality
  • 11% enhancement in HEVC encoding quality
  • Optimized B Frames for AV1 encoding efficiency
  • Up to 30% encoding performance boost at 720p
  • Compatibility with FFMPEG, OBS, and Handbrake
  • VCN low-power video playback, delivering a 50% performance uplift for AV1 and VP9 formats
Media Engine Enhancements

Improvements in display technology focus on enhanced FreeSync power optimization, which significantly reduces idle power consumption in dual-display configurations. Additionally, hardware support for frame scheduling offloads tasks to the GPU, allowing CPUs to conserve power during video playback. Lastly, Radeon Image Sharpening 2 ensures high-quality imagery across all APIs with a singular, straightforward toggle.

Display Engine Upgrades

Source & Images

Leave a Reply

Your email address will not be published. Required fields are marked *