Unlocking AI Scale With SSD Offload Techniques

Use 57% less memory and increase query speed by 50%? That can’t be right, can it?

AI generated image depiciting AI data offloading from GPU memory to SSD.
AI generated image depiciting AI data offloading from GPU memory to SSD.

 

AI inference with retrieval-augmented generation (RAG) is having a Woodstock moment.

Organizers expected about 50,000 attendees at the legendary 1969 music festival. But the lineup was so killer—Jimi Hendrix, Janis Joplin, The Who, and dozens of others—and the counterculture vibe so perfectly timed, that the event drew nearly a half-million people to a dairy farm in Bethel, NY. Consequently, the experience of many was driven more by food shortages, sanitation problems, and mud than by the music.

Woodstock music festival comparison to inference with RAG in AI SSD offloading.


In a similar way, inference with RAG is such a powerful and appealing approach to solving business problems that enterprises are rushing to adopt it in droves. The problem, of course, is that without sufficient infrastructure in place, the technology’s potential is dampered by poor experience or lack of access altogether.

A new approach is needed, one that enables unprecedented levels of scalability and cost efficiency. We’re excited to share the results of groundbreaking work by Solidigm and Metrum AI. The strategy we outline here offloads significant amounts of data, including AI model weights and RAG data, from costly memory to high-performance SSDs, unlocking the value of AI like never before.

“We have developed a cutting-edge RAG solution for video analysis, leveraging state-of-the-art vision-language models and large language models to generate rich, contextual summaries,” says Steen Graham, CEO of Metrum AI. “By deploying on Solidigm D7-PS1010 SSDs and integrating DiskANN for high-speed, memory-efficient vector search, we optimized memory usage without compromising performance.”

Read on for details about our approach and key findings. You can also download the full white paper, High-Performance TCO-Optimized RAG With SSD Offloading, and for the “prove it” crowd, we’ve made the entire thing available in a GitHub repo here for you to try it yourself!

What is RAG, and why is it catching fire?

A simple example: Imagine you ask an AI chatbot for guidance on what documents are needed to travel to a foreign country. If sufficient correct information was present in the model’s training data set, it will give you a helpful answer.

If not, one of two things may happen. It could tell you it doesn’t know or worse, it could confidently give you a wrong answer. This is called a hallucination, and it’s more common than you might think.

Clearly, the value of AI is tied to the quantity and quality of data available to the model.

Retrieval-augmented generation, as the name implies, retrieves additional relevant data to augment a model’s knowledge before generating a response. This is accomplished by connecting the model to sources of data that were not included in the original training set. It could be an internal corporate database, a news feed, even Wikipedia; almost any source. So in our example, the user’s travel query would be fed to one or more of these sources, and relevant information grabbed before going to the AI model for processing, increasing the likelihood of a good response.

The benefits of RAG are twofold

  1. Enterprises don’t have to continually retrain models to include more data. 
  2. It enables models to consult information that is more timely, authoritative, and specific than whatever was available in the public training set.
Overview of RAG data set vs training data set for AI training to get the most accurate response.


An interesting debate has been heating up around whether RAG is already dead in light of new models that offer huge context windows. Meta’s Llama 4 Scout, for example, accommodates 10M tokens. The argument goes that if you can feed that much data into the prompt, you don’t need to connect to external data sources; just include all relevant info in the prompt itself.

It’s a reasonable argument on the surface, but may be premature. A March 2025 research paper tested recall (accuracy) of some of these newer models with big context windows and found that even if a model ostensibly supports context windows in the millions, recall suffered when using more than a fraction, about 2K in most cases.

The problem?

You can see why companies are embracing RAG-enabled inference. The problem is the same one Woodstock organizers faced more than 50 years ago. More users are demanding more of it in a very sudden way.

Specifically, enterprises want:

  1. Bigger RAG data sets to increase the quantity and quality of data available to AI models
  2. More complex models to process the data and generate high-quality insights

Neither of these are bad goals. But they both involve a heck of a lot of data, which must be stored somewhere. In today’s world, where model weights and RAG data tend to be stored in memory, that prospect gets extremely expensive quickly.

Introducing the SSD offload approach

Working with Metrum AI, Solidigm has pioneered a new way forward. Our approach relies on open-source software components, carefully selected and fine-tuned to work together beautifully, to move a significant amount of data from memory to SSDs during AI inference.

There are two key components:

  1. RAG data offload: Using DiskANN, a suite of algorithms for large-scale vector data searches, we can relocate part of the RAG data set to SSDs. The primary benefit here is the ability to scale to much larger data sets in a much more cost-effective way.
  2. Model weight offload: Using Ray Serve with the DeepSpeed software suite, we can move a portion of the AI model itself to SSDs. The main benefit of doing so is enabling more complex models, or multiple models, in a fixed GPU memory budget. For example, we demonstrate the ability to run a 70-billion-parameter model, which typically requires about 160GB of memory, at a reduced peak usage of just 7GB to 8GB.

Key findings

1. Reduced DRAM usage

DRAM usage with SSD offloading vs without SSD offloading measuring performance of Solidigm SSDs. Figure 1. DRAM usage with and without SSD offloading (lower is better)


The main value in offloading AI data from memory to SSDs is, naturally, that you need less memory. We used VectorDBBench, an open-source benchmarking tool for databases, to measure the effect across three data sets of increasingly bigger size, from 1 million vectors to 100 million.

The magnitude of the benefit scaled with database size. In other words, the more data you’re dealing with, the bigger the memory savings. On the largest data set, we observed a decrease in DRAM usage of 191GB; a 57% decrease. At today’s pricing, that’s a significant cost reduction.

2. Increased query speed

Queries per second with and without SSD offloading measuring performance of Solidigm SSDs. Figure 2. Queries per second with and without SSD offloading (higher is better)


By moving data from memory to SSDs, we observed an increase in performance as measured in queries per second (QPS): up to 70% higher on the middle data set, and 50% on the largest one. In other words, not only can you do inference with less memory, but you can also do it faster.

This may strike you as counter-intuitive; when do you ever see a performance increase by reading from storage instead of memory? But we triple-checked the numbers. When configured with default parameters, DiskANN produces higher QPS than HNSW (the conventional in-memory approach). Indexing algorithms with plenty of pre-processing such as Vamana, used by DiskANN, can dramatically speed up similarity searches by efficiently packing the vectors into the SSD (more on indexing in a bit).

It's worth mentioning that in Solidigm testing, HNSW performance could be improved by modifying certain parameters, but at the cost of even higher memory use.

3. The tradeoff: Increased build time

Queries per second with and without SSD offloading measuring performance of Solidigm SSDs. Figure 3. Indexing time with and without SSD offloading (lower is better)


It’s been said there’s no such thing as a free lunch, and that’s true here too. The up-front time it takes to build the RAG index is 30% to 60% higher using the offload approach.

The payoff for doing more work up front, of course, is the better performance on an ongoing basis once the stack is deployed.

For certain use cases, this could be a deal breaker. For many others, though, the benefits, in terms of memory reduction and QPS improvement, will far outweigh the increased index build time. After all, indexing is an infrequent activity relative to how often you are actually using the model to generate valuable insights.

4. High recall

Recall with and without SSD offloading measuring performance of Solidigm SSDs. Figure 4. Recall with and without SSD offloading (higher is better)


Finally, a note on recall, or how accurate the model outputs are. We observed no significant difference between the conventional and SSD offload approaches, with both clocking rates near 100%. In other words, offloading data did not hurt output quality.

Conclusion

That’s what we measured. We believe there’s significant value here for businesses who want to pull lots of data into their inference pipeline without breaking the bank. Offloading RAG data means scaling to bigger data sets at a lower cost; offloading model weights enables companies to more easily deploy solutions on legacy hardware or at the edge, where GPU memory constraints are more severe.

But you don’t have to take our word for it.  Check out the GitHub repo for everything you need to reproduce these results yourself. You can also dig into the data and methodology in greater detail in the white paper, High-Performance TCO-Optimized RAG With SSD Offloading.


About the Author

Ace Stryker is the director of market development at Solidigm, where he focuses on emerging applications for the company’s portfolio of data center storage solutions.

Notes

A version of this article was originally published by Data Center Frontier.  

See High-Performance TCO-Optimized RAG With SSD Offloading White Paper for references to data and graphs reproduced in the article.

Nothing herein is intended to create any express or implied warranty, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, or any warranty arising from course of performance, course of dealing, or usage in trade.

The products described in this document may contain design defects or errors known as “errata,” which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Solidigm does not control or audit third-party data. You should consult other sources to evaluate accuracy.

Contact your Solidigm representative or your distributor to obtain the latest specifications before placing your product order. 

SOLIDIGM and the Solidigm “S” logo are trademarks of SK hynix NAND Product Solutions Corp. (d/b/a Solidigm), registered in the United States, People’s Republic of China, Japan, Singapore, the European Union, the United Kingdom, Mexico, and other countries.