close
close

New LLM optimization technique reduces memory costs by up to 75%

New LLM optimization technique reduces memory costs by up to 75%

Join our daily and weekly newsletters for the latest updates and exclusive content covering cutting-edge AI. Learn more


Researchers at Tokyo-based startup Sakana AI have developed a new technique that allows language models to use memory more efficiently, helping companies reduce the cost of building applications on top of large models Language Learning (LLM) and other Transformer-based models.

The technique, called “universal transformer memory,” uses special neural networks to optimize LLMs to retain important pieces of information and remove redundant details from context.

Transformer memory optimization

The responses of Transformer models, the backbone of LLMs, depend on the contents of their “popup”, that is, what they receive as input from users.

The pop-up window can be thought of as the working memory of the model. Changing the content of the popup can have a huge impact on model performance, which has given rise to an entire field of “prompt engineering.”

Current models support very long pop-ups with hundreds of thousands, or even millions, of tokens (an LLM’s numerical representations of the words, word parts, phrases, concepts, and numbers users enter in their prompts) .

This allows users to incorporate more information into their prompts. However, longer prompts may result in higher computational costs and slower performance. Optimizing prompts to remove unnecessary tokens while retaining important information can reduce costs and increase speed.

Current prompt optimization techniques are resource-intensive or require users to manually test different configurations to reduce the size of their prompts.

Neural Attention Memory Modules

Transformer Universal Memory optimizes prompts using Neural Attention Memory Models (NAMM), simple neural networks that decide whether to “remember” or “forget” each given token stored in the LLM’s memory .

“This new capability allows Transformers to eliminate unnecessary or redundant details and focus on the most critical information, which we find crucial for tasks requiring reasoning in a long context,” the researchers write.

Universal Transformer Memory (source: Sakana AI)

NAMMs are trained separately from the LLM and are combined with the pre-trained model at inference time, making them flexible and easy to deploy. However, they need access to the model’s internal activations, which means they can only be applied to open source models.

Like other techniques developed by Sakana AI, NAMMs are trained via evolutionary algorithms rather than gradient-based optimization methods. By iteratively mutating and selecting the best performing models through trial and error, evolution algorithms optimize NAMMs for efficiency and performance. This is particularly important since NAMMs attempt to achieve an indistinguishable goal: retaining or discarding tokens.

NAMMs work on top of the attention layers of LLMs, one of the key components of the Transformer architecture that determines the relationships and importance of each token in the model popup. Based on the attention values, NAMMs determine which tokens should be kept and which can be removed from the LLM pop-up. This attention-based mechanism allows using a NAMM trained on different models without further modification. For example, a NAMM trained on text data can be applied to vision or multimodal models without additional training.

Neural Attention Memory Models (NAMM) examine layers of attention to determine which tokens should be kept or removed from the pop-up window (source: Sakana AI)

Universal memory in action

To test the concept of universal transformer memory in action, the researchers trained a NAMM on an open source Meta Llama 3-8B model. Their experiments show that with NAMMs, Transformer-based models perform better on natural language and coding problems over very long sequences. Meanwhile, by removing unnecessary tokens, NAMM allowed the LLM model to save up to 75% of its cache during task execution.

“In our tests, NAMMs provide clear performance improvements to the Llama 3-8B transformer,” the researchers write. “Additionally, our memory systems provide notable side benefits, reducing the context size of each layer, without ever being explicitly optimized for memory efficiency.”

NAMM models compete with leading rapid optimization techniques while improving model performance (source: Sakana AI)

They also tested the model on version 70B of Llama as well as Transformer models designed for other modalities and tasks, such as Llava (computer vision) and Decision Transformer (reinforcement learning).

“Even in these non-distributed settings, NAMMs retain their advantages by removing tokens such as redundant video frames and suboptimal actions, allowing their new core models to focus on the most relevant information to improve outcomes. performance,” the researchers write.

Task-dependent behavior

Another interesting finding is that NAMMs automatically adjust their behavior depending on the task.

For example, for coding tasks, the model removes contiguous pieces of tokens that correspond to comments and spaces that do not affect code execution.

In contrast, in natural language tasks, the model eliminates tokens that represent grammatical redundancies and do not affect the meaning of the sequence.

Researchers have released the code to create your own NAMMs. Techniques such as universal transformer memory can be very useful for enterprise applications that process millions of tokens and can benefit from speed increases and cost reductions. Reusing a trained NAMM also makes it a versatile tool to use across different applications within a business.

Looking ahead, researchers suggest more advanced techniques, such as using NAMMs when training LLMs to further expand their memory capabilities.

“This work only begins to harness the potential of our new class of memory models, which we believe could provide many new opportunities for advancing future generations of Transformers,” the researchers write.