Apple has taken a distinctive path in the realm of Generative AI, opting to utilize its proprietary silicon for cloud-based applications rather than relying on NVIDIA GPUs. This strategic decision is anticipated to evolve with the introduction of the forthcoming M4 Ultra chip, which aims to enhance the processing capabilities for Large Language Models (LLMs). Recently, however, Apple has indicated a willingness to collaborate with NVIDIA to accelerate text generation processes using LLMs, showcasing the potential for synergy between the two tech giants.
Introducing ‘ReDrafter’: A Game-Changer in Text Generation
Apple recently unveiled an innovative technique known as ‘ReDrafter’—short for Recurrent Drafter—which sets a new benchmark in text generation technologies. This method ingeniously integrates two distinct approaches: beam search and tree attention. Both strategies are engineered to enhance performance in generating text. Following extensive internal research, Apple worked alongside NVIDIA to embed ReDrafter within the TensorRT-LLM framework, a sophisticated tool optimized for accelerating the performance of LLMs running on NVIDIA hardware.
Importantly, ReDrafter is not only designed to enhance speed but also aims to reduce operational latency while consuming less energy—an increasingly critical factor in today’s tech landscape.
“This research work demonstrated strong results, but its greater impact comes from being applied in production to accelerate LLM inference. To make this advancement production-ready for NVIDIA GPUs, we collaborated with NVIDIA to integrate ReDrafter into the NVIDIA TensorRT-LLM inference acceleration framework.
Although TensorRT-LLM supports numerous open-source LLMs and the Medusa speculative decoding method, ReDrafter’s beam search and tree attention algorithms rely on operators that had never been used in previous applications. To enable the integration of ReDrafter, NVIDIA added new operators or exposed existing ones, which considerably improved TensorRT-LLM’s capability to accommodate sophisticated models and decoding methods. ML developers using NVIDIA GPUs can now easily benefit from ReDrafter’s accelerated token generation for their production LLM applications with TensorRT-LLM.
In benchmarking a tens-of-billions parameter production model on NVIDIA GPUs, using the NVIDIA TensorRT-LLM inference acceleration framework with ReDrafter, we have seen a 2.7x speed-up in generated tokens per second for greedy decoding. These benchmark results indicate this tech could significantly reduce latency users may experience, while also using fewer GPUs and consuming less power.”
This collaboration signifies a potential, albeit tenuous, alliance between Apple and NVIDIA, reminiscent of the partnerships that tech companies often forge driven by mutual interests. However, lingering historical tensions between the two companies cast doubt on the likelihood of a sustained formal partnership. While temporary collaborations like this may resurface, the prospect of a long-term alliance seems unlikely.
For further details, explore the original news release by Apple: Apple’s Official Blog.
Additionally, insights can be found in this comprehensive article: Wccftech’s Coverage.
Leave a Reply