DeepSeek V4 Reduces KV Cache Usage by 90% for 1M Tokens, Though Aggressive Compression May Lead to ‘Needle in a Haystack’ Issues

DeepSeek, a leading Chinese artificial intelligence lab, has unveiled its latest V4 model, boasting a substantial reduction in the computing resources needed for token inference. According to their release notes, this new model operates with only 27% of the single-token inference FLOPs and 10% of the key-value (KV) cache required by its predecessor, the DeepSeek V3.2. This innovative development not only cuts down on memory consumption but also significantly enhances the context capacity available to developers when constructing their models.

DeepSeek V4: Enhanced Performance and Cache Efficiency

In the V4 model, DeepSeek demonstrates its capabilities by managing to function on merely 27% of the single-token inference FLOPs along with a mere 10% of KV cache while handling a context window of one million tokens. The context window represents the volume of text that a large language model processes before needing to release memory resources.

This refined memory utilization is particularly crucial during the Decode phase of AI computation, which is typically divided into two stages: Prefill and Decode. During the Decode phase, the AI generates outputs while simultaneously maintaining the conversational context established in the Prefill stage. Consequently, the Decode phase demands higher memory usage, especially concerning the KV cache.

A flowchart illustrating the transformer model process with labeled elements such as 'Cache Evictions, ' 'Cache Hit, ' and 'Cache Miss, ' — An NVIDIA diagram illustrating the KV Cache operation. Image: Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache.

Building on Previous Innovations: Enhanced Features of DeepSeek Models

As the context length increases, so does the demand on the KV cache. At the one million token mark, a model that minimizes cache usage can process a greater number of requests while needing less memory overall. However, DeepSeek’s claim regarding the V4 model achieving 27% single-inference token FLOPs relies on the availability of adequate GPU memory to facilitate calculations.

Moreover, the significant drop in cache memory necessitates trade-offs; this can lead to scenarios labeled as “needle in a haystack”failures, where the model might overlook essential details, resulting in less accurate outputs. This challenge underscores the importance of balancing memory efficiency with the need for high-fidelity outputs.

The latest enhancements in DeepSeek’s V4 model are grounded in their Multi-Head Latent Attention architecture introduced in prior versions. This design strategically addresses memory limitations by compressing the model’s key and value into a unified structure, subsequently expanding during computation, allowing efficient resource utilization.

Source&Images