Storage is the New Foundation of AI Inference

In this Six Five On The Road interview filmed at Computex 2026 in Taipei, Ryan Shrout sits down with Avi Shetty, Vice President of AI Ecosystem and Market Enablement at Solidigm, to explore how AI inference is reshaping data center storage architecture and GPU utilization.

Their central message: As enterprises shift from AI training to AI inference, storage is a critical performance lever.

Inference is fundamentally about responsiveness to the end user, and data now lives across multiple storage tiers depending on its "hotness"—whether it sits in the prefill cache, the KV cache, HBM (high-bandwidth memory), or has been evicted. Each tier directly affects token TCO (total cost of ownership), a defining metric for inference workloads.

Avi illustrates the scale with the Solidigm "Anatomy of a Prompt" study, showing how a simple eight-word LLM query ("where's the best dumpling in Taipei?") expands to roughly 42,000 tokens and 12GB to 13GB of KV cache that must be stored somewhere. As context windows grow and saturate HBM, the GPU must recompute, reducing GPU utilization, the most valuable asset in any AI data center.

The conversation references NVIDIA CEO Jensen Huang's CES and GTC remarks that 2026 is the "year of storage," with the context memory tier projected to consume the entire storage TAM.

Avi outlines a tiered model:

G3 (direct-attached storage) for feeding GPUs at high speed
G3.5 (context memory) to avoid recompute
G4 (shared storage) for capacity

He argues high-density QLC storage has made hard drives obsolete for the data center, showcasing the Solidigm industry-first 122TB SSD, 24 of which deliver 4PB of low-power, scalable capacity in a single rack.

The takeaway for enterprises evaluating RFPs and AI infrastructure: Ask how much latency and GPU recompute time you can afford, then provision scalable NVMe storage accordingly.

Avi Shetty: The whole goal, the whole game, is to make sure that the GPUs are fully utilized. With increasing context memories, you have to ensure storage architecture is part of your data center architecture.

Ryan Shrout: Hey, everybody. Welcome to Six Five On The Road. I'm your host, Ryan Shrout, here at Computex 2026 in Taipei. And we're going to talk about how AI inference is impacting the world of storage and compute. And I'm joined by a good friend of mine, Avi Shetty. You are the Vice President of AI Ecosystem and Market Enablement at Solidigm. Welcome. Thanks for joining us.

Avi: Oh, thank you, and great to be here, and great to be here at Computex. I think there'll be a lot of cool technology announcements happening from all the ecosystem partners here, which I'm excited about.

Ryan: It's been interesting to see the evolution of Computex here in Taipei over the last handful of years as kind of the AI evolution revolution has kind of reset a whole bunch of expectations, so it's been interesting.

Avi: Yeah, you've seen how this entire conference has fundamentally changed its value prop and mission, right? I was telling your team, this is I think my 14th or 15th year coming here, but I used to come here in a client role. I used to be a PC, and all the coolest PC tech. It was the hotbed for that announcement but now it's all about AI.

Ryan: So, I want to ask you some questions about the data center world, this AI inference build out. I think one of the things that's been interesting as we've, you know, it feels like forever ago, but it's only been a couple of years, we started moving from training to inference. And I think a lot of people started to assume that maybe storage and capacity and performance were only limited to being important in that training space. In inference, we're still seeing storage as a critical value and performance part of that story. Why is that still the case?

Avi: I think you'll see a lot more. If not, you've not heard much. You'll see a lot more of storage as key to inference-related architectural discussions. And the fundamental reason is inference is all about responsiveness to the end user. And as a result, you have data which is located in different tiers, depending on the hotness and depending on whether it's in the prefill cache, whether it's in your KV cache, whether it's an HBM, whether it's evicted. All of those add to, in the end, token TCO, which is essentially the parameter which in inference is required; inference requires. You know, you can't fit everything in the HBM. And as a result, you need new tiers to essentially offload, to read up from caches so that you give better responsiveness back to whatever you're doing. Well, we did a study, Solidigm did a study of a simple LLM request, right? Let's talk about, "hey, we are in Taipei, where's the best dumpling available," right? It's a simple eight word, nine word query. We did a full study, which we kind of talk about on our website. It's called the Anatomy of a [Prompt] token. And this eight word search translates to around 42,000 tokens. And roughly 12[GB] to 13GB of KV that has to be stored somewhere. That's one such query. Now imagine you are in an engineering environment and you are asking, hey, fix this ticket on Jira, look for our history, whether we've solved it. Now the query's becoming a lot more complex, and as a result, you'll see a lot more tokens. I think all of that is part of the inference workload, and you'll see data just continuously growing.

Ryan: You mentioned KV cache, I know that's an important one, and what about context windows? Where are the other areas that the enterprises might be seeing that storage [need]?

Avi: Yeah, context windows are growing, and guess what happens when the context window grows and fully utilizes your HBM? Now your GPU has to recompute. And what that means is, recompute means GPU is not fully utilized. Your primary asset in your data center is your GPU. The whole goal, the whole game is to make sure that the GPUs are fully utilized. And with increasing context memories, you have to ensure storage architecture is part of your data center architecture. You've ensured that you've provisioned storage at different tiers, as well as ability to scale as well. Which will add value to your end token output, which ensures that your GPU continuously remains utilized.

Ryan: When I think about enterprises that are getting quotes, they're looking at RFPs, they're trying to build out their infrastructure, how do you see the balance shifting between, “I used to only worry about how many GPUs I needed,” and now we're talking about with agents, “how many CPUs do I need?” Where does storage fit into that? Like, how do you recommend or kind of suggest that these enterprises look at including storage in that decision?

Avi: Fundamentally, the difference between training and inference. Training happens at these megawatt data centers, big foundational companies, but inference can happen anywhere, right? It can happen at your back office, can happen at your small data center, which is in your basement, or at a local colo location where you set up. So it depends upon what individual enterprise's usage is. NVIDIA has done the blueprint, like this year, 2026, is the year of storage, right? The start of the year, Jensen talked about at CES and then followed it up at GTC about storage, this whole KV tier. His quote, which I think resonates with all our storage vendors, as well as us in Solidigm, he said, the context memory tier will use up the entire TAM of storage going forward. So that's the amount of context memory scale which you'll see with inference over the next few years. And for enterprises who are determining what's the best way to use it, I think the question they need to ask themselves is, how much latency can I afford? How much GPU recompute time can afford. If those are very critical parameters, you need to ensure you have the right storage architecture at the three levels, G3, which is your direct attach, G3.5, which is your context memory, and then G4, which is your shared storage.

Ryan: When you look out those next couple of three years, and it's hard to do in this space and really kind of have any accurate predictions, but I'm curious, what role does storage play two years from now or three years from now? Is it a capacity game? Is it a performance game? Is it just, you know, how does it change?

Avi: I think it's all of the above, right? There is no one size fits all in storage. I think, you know, storage is like the whole memory hierarchy or the memory wall is a function of economics, right? Ideally, if you ask any GPU architect or CPU architect, "what do you want?" They'll say, "hey, give me one petabyte of persistent storage and it's non-volatile and SRAM-like latency," but that's not feasible. That's why you have tiers. You have your SRAM, your caches, you have your HBM, your DRAM, your tiered storage: Tiered storage, G3, G3.5, and G4. And every one of those will see innovations where the focus on G3 is to ensure the whole purpose of that whole section is to ensure that your GPU is fed at high speeds and that results in GPU utilization being high. Your G3.5 is to ensure that you don't have to recompute your context every time. So there you need a function of performance as well as density. And that's a function of how big your workloads are and [what] the context is. And when it comes to shared storage, I think we're now in a world where enough data points have been shown where: No more hard drives. I think it's purely a math of whether you want nine racks of hard drives, or do you want one rack with high-density QLC storage, which I've got it for you here. This is our 122TB solution in one U.2 form factor. We were the first ones to introduce this back in Q4 [2025].

Ryan: That’s a lot of storage.

Avi: That's a lot of storage, but not large enough, not large enough for when you look at data center efficiencies, you put 24 of these in 1U, now you have four petabytes and low power and scalable for the end customers.

Ryan: It's amazing to see how this revolution has kind of changed what you and I came from in the client space where we would look at drives like this all the time and kind of how they're being repurposed and where the bottlenecks are really lying. Avi, thanks for joining me. Really appreciate the conversation. Thank you, everybody, for tuning in to Six Five on the Road at Computex 2026 in Taipei. I'm Ryan Shrout. And make sure you follow us on social media and find all of our other content at sixfivemedia.com.

Related Articles