AMD Explores Stacking L2 Cache for Future Chips to Improve Latency Beyond Traditional Designs After Stacked L3

In a recent research initiative, AMD is investigating methods for integrating L2 cache in a stacked configuration within its upcoming processors. This development aims to maintain or even enhance latency performance.

Advancements in Chip Design: AMD’s Stacked L2 Cache Exploration

AMD has introduced an intriguing research paper titled “Balanced Latency Stacked Cache“, associated with patent application number US20260003794A1. In this paper, AMD outlines methodologies for a balanced latency stacked cache system, which incorporates at least two cache dies stacked vertically.

A presentation slide titled '2nd Gen AMD 3D V-Cache Technology' illustrates features like 'Up to 8-core Zen 5 CCD', '64MB L3 Cache Die', 'Through Silicon Vias (TSVs) for silicon-to-silicon communication', and 'Direct copper-to-copper bond'.

AMD is already well-known for utilizing stacked cache technology in their 3D V-Cache product line, which introduces an additional layer of L3 cache positioned either above or below the core compute chiplets. The first iteration of 3D V-Cache was situated atop Zen compute chiplets, whereas the second generation reversed this configuration, placing the stack beneath the compute chiplet. While the strategy remains consistent, the configurations differ in execution.

The 3D V-Cache, or X3D technology, is deployed across various AMD chips, spanning from the consumer “Ryzen”family to the high-performance “EPYC”series designed for data centers. As AMD progresses with its L3 3D V-Cache innovations, it is now poised to expand its caching technology by investigating the potential of stacked L2 caches, as suggested by their latest patent.

A diagram labeled 'FIG.3' illustrates a comparison of a multi-tiered core design with 'Core 310' and 'Base Die 304' on top versus a complex structure featuring multiple 'L2 Die' and 'L3 Die' configurations on 'Base Die 406' below. — Image Source: AMD Patent

For the design of its stacked L2 cache, AMD illustrates a foundation die integrated with both compute and cache dies, along with an additional compute and cache die layered above. This configuration demonstrates a cache module composed of four 512 KB segments, culminating in a total of 2 MB of L2 cache, managed by Cache Control Circuitry (CCC).The architecture is scalable, with designs allowing for up to 4 MB of L2 cache, as depicted in the accompanying block diagram.

A diagram titled Balanced Latency Stacked Cache illustrating a cache die structure with labeled sections including '512KB Region, ' 'Tag Field, ' and 'Cache Control Circuitry, ' alongside a base die. — Image Source: AMD Patent

The stacking strategy mirrors the principles of 3D V-Cache, linking the L2 and L3 caches to the base die and compute complexes through vertically aligned Silicon Vias. The CCC governs the data flow throughout this system.

A notable point in AMD’s findings is the comparison of latency between planar and stacked configurations. The research cites that a planar 1 MB L2M cache typically incurs a latency of 14 cycles, while a stacked version reduces that latency to just 12 cycles. Thus, not only does the stacked L2 cache configuration support increased capacity, but it also achieves equal or lower latency in comparison to traditional planar setups.

A diagram labeled 'FIG.6' shows a base die '606' with stacked 'L2 Die' and 'L3 Die' components connected by markers '602, ' '604, ' and '608.' — Image Source: AMD Patent

In aspects of the described techniques, the configuration of the stacked cache system reduces response latency when accessing the stacked cache, and also provides a power savings feature. The stacked cache system improves data transfer performance, and has a lower latency than a conventional planar cache built on a single die. Notably, the connection vias are routed into and out of the center of the stacked cache system. This avoids adding wire stages (also referred to herein as pipe stages), as in a conventional planar cache, to route data over one part of the cache to reach a portion of the cache that is further away from the data I/Os.

In the described techniques, the connection vias that are routed center of the stacked cache system create balanced (or identical) latencies between the two halves of the stacked cache system on the stacked die (e.g., of the first cache die and the at least second cache die).For example, a conventional planar 1 MB L2M cache has a 14 cycle latency, while a stacked 1 MB L2M cache implemented using the described techniques has only a 12 cycle latency. This provides for implementation of a larger stacked cache than a typical planar cache, yet achieves the same or better cycle latency.

Accordingly, the described aspects of balanced latency stacked cache provides lower latency for an access request, and data is returned from the data cache faster. There is also a power savings due to an access request being accomplished in fewer cycles, so an L2 cache for example, is not turned on for as long, as well as a power savings when transitioning sooner from an active state to an idle state of the cache. Additionally, wire lengths in the cache die are shorter, which effectively results in less capacitance and also conserves power. There is also less signal loading because the signals are only traveling half the distance for an access request, and the data return.Further, less heat is being generated as a result of the power savings, less capacitance, and signals traveling less distance.

via AMD Research Paper (Google Patents)

Beyond just reducing latency, AMD emphasizes the energy efficiency gained through the stacked L2 cache design. Although it may take some time before we witness the practical application of stacked L2 caches in actual hardware, there is strong optimism that this innovation will be featured in the next generation of AMD processors and GPUs alike, revealing further advancements in chip design.