Organizers expected about 50,000 attendees at the legendary 1969 music festival. But the lineup was so killer—Jimi Hendrix, Janis Joplin, The Who, and dozens of others—and the counterculture vibe so perfectly timed, that the event drew nearly a half-million people to a dairy farm in Bethel, NY. Consequently, the experience of many was driven more by food shortages, sanitation problems, and mud than by the music.
In a similar way, inference with RAG is such a powerful and appealing approach to solving business problems that enterprises are rushing to adopt it in droves. The problem, of course, is that without sufficient infrastructure in place, the technology’s potential is dampered by poor experience or lack of access altogether.
A new approach is needed, one that enables unprecedented levels of scalability and cost efficiency. We’re excited to share the results of groundbreaking work by Solidigm and Metrum AI. The strategy we outline here offloads significant amounts of data, including AI model weights and RAG data, from costly memory to high-performance SSDs, unlocking the value of AI like never before.
“We have developed a cutting-edge RAG solution for video analysis, leveraging state-of-the-art vision-language models and large language models to generate rich, contextual summaries,” says Steen Graham, CEO of Metrum AI. “By deploying on Solidigm D7-PS1010 SSDs and integrating DiskANN for high-speed, memory-efficient vector search, we optimized memory usage without compromising performance.”
Read on for details about our approach and key findings. You can also download the full white paper, High-Performance TCO-Optimized RAG With SSD Offloading, and for the “prove it” crowd, we’ve made the entire thing available in a GitHub repo here for you to try it yourself!
A simple example: Imagine you ask an AI chatbot for guidance on what documents are needed to travel to a foreign country. If sufficient correct information was present in the model’s training data set, it will give you a helpful answer.
If not, one of two things may happen. It could tell you it doesn’t know or worse, it could confidently give you a wrong answer. This is called a hallucination, and it’s more common than you might think.
Clearly, the value of AI is tied to the quantity and quality of data available to the model.
Retrieval-augmented generation, as the name implies, retrieves additional relevant data to augment a model’s knowledge before generating a response. This is accomplished by connecting the model to sources of data that were not included in the original training set. It could be an internal corporate database, a news feed, even Wikipedia; almost any source. So in our example, the user’s travel query would be fed to one or more of these sources, and relevant information grabbed before going to the AI model for processing, increasing the likelihood of a good response.
An interesting debate has been heating up around whether RAG is already dead in light of new models that offer huge context windows. Meta’s Llama 4 Scout, for example, accommodates 10M tokens. The argument goes that if you can feed that much data into the prompt, you don’t need to connect to external data sources; just include all relevant info in the prompt itself.
It’s a reasonable argument on the surface, but may be premature. A March 2025 research paper tested recall (accuracy) of some of these newer models with big context windows and found that even if a model ostensibly supports context windows in the millions, recall suffered when using more than a fraction, about 2K in most cases.
You can see why companies are embracing RAG-enabled inference. The problem is the same one Woodstock organizers faced more than 50 years ago. More users are demanding more of it in a very sudden way.
Specifically, enterprises want:
Neither of these are bad goals. But they both involve a heck of a lot of data, which must be stored somewhere. In today’s world, where model weights and RAG data tend to be stored in memory, that prospect gets extremely expensive quickly.
Working with Metrum AI, Solidigm has pioneered a new way forward. Our approach relies on open-source software components, carefully selected and fine-tuned to work together beautifully, to move a significant amount of data from memory to SSDs during AI inference.
There are two key components:
The main value in offloading AI data from memory to SSDs is, naturally, that you need less memory. We used VectorDBBench, an open-source benchmarking tool for databases, to measure the effect across three data sets of increasingly bigger size, from 1 million vectors to 100 million.
The magnitude of the benefit scaled with database size. In other words, the more data you’re dealing with, the bigger the memory savings. On the largest data set, we observed a decrease in DRAM usage of 191GB; a 57% decrease. At today’s pricing, that’s a significant cost reduction.
By moving data from memory to SSDs, we observed an increase in performance as measured in queries per second (QPS): up to 70% higher on the middle data set, and 50% on the largest one. In other words, not only can you do inference with less memory, but you can also do it faster.
This may strike you as counter-intuitive; when do you ever see a performance increase by reading from storage instead of memory? But we triple-checked the numbers. When configured with default parameters, DiskANN produces higher QPS than HNSW (the conventional in-memory approach). Indexing algorithms with plenty of pre-processing such as Vamana, used by DiskANN, can dramatically speed up similarity searches by efficiently packing the vectors into the SSD (more on indexing in a bit).
It's worth mentioning that in Solidigm testing, HNSW performance could be improved by modifying certain parameters, but at the cost of even higher memory use.
It’s been said there’s no such thing as a free lunch, and that’s true here too. The up-front time it takes to build the RAG index is 30% to 60% higher using the offload approach.
The payoff for doing more work up front, of course, is the better performance on an ongoing basis once the stack is deployed.
For certain use cases, this could be a deal breaker. For many others, though, the benefits, in terms of memory reduction and QPS improvement, will far outweigh the increased index build time. After all, indexing is an infrequent activity relative to how often you are actually using the model to generate valuable insights.
Finally, a note on recall, or how accurate the model outputs are. We observed no significant difference between the conventional and SSD offload approaches, with both clocking rates near 100%. In other words, offloading data did not hurt output quality.
That’s what we measured. We believe there’s significant value here for businesses who want to pull lots of data into their inference pipeline without breaking the bank. Offloading RAG data means scaling to bigger data sets at a lower cost; offloading model weights enables companies to more easily deploy solutions on legacy hardware or at the edge, where GPU memory constraints are more severe.
But you don’t have to take our word for it. Check out the GitHub repo for everything you need to reproduce these results yourself. You can also dig into the data and methodology in greater detail in the white paper, High-Performance TCO-Optimized RAG With SSD Offloading.
Ace Stryker is the director of market development at Solidigm, where he focuses on emerging applications for the company’s portfolio of data center storage solutions.
A version of this article was originally published by Data Center Frontier.
See High-Performance TCO-Optimized RAG With SSD Offloading White Paper for references to data and graphs reproduced in the article.
Nothing herein is intended to create any express or implied warranty, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, or any warranty arising from course of performance, course of dealing, or usage in trade.
The products described in this document may contain design defects or errors known as “errata,” which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Solidigm does not control or audit third-party data. You should consult other sources to evaluate accuracy.
Contact your Solidigm representative or your distributor to obtain the latest specifications before placing your product order.
SOLIDIGM and the Solidigm “S” logo are trademarks of SK hynix NAND Product Solutions Corp. (d/b/a Solidigm), registered in the United States, People’s Republic of China, Japan, Singapore, the European Union, the United Kingdom, Mexico, and other countries.