Data’s Big Moment: Network Attached Storage is Fueling the AI Surge

20,000,000,000,000 tokens or 20 trillion if you prefer. That’s the estimated number of training dataset tokens in Qwen2.5-MAX, the latest LLM foundation model from Alibaba.1 If you and your descendants were (unfortunately) enlisted to serially type out every token, it would take about 600,000,000 years,2 so plan accordingly. This is all to say that modern AI is an information glutton so storing and activating AI data inputs and outputs create unprecedented storage capacity and efficiency challenges. Network Attached Storage (NAS) solutions have emerged as a critical component in managing the massive datasets required for AI model development and deployment. We will explore the pivotal role of NAS in AI, the value of high-capacity Solid-State Drives (SSDs), and where and how these drives fit in the overall AI data pipeline.

The central role of NAS in AI Data Pipelines

Ultimately, AI is the product of data and intense compute, but turning raw bits into sharp models takes work. AI data pipelines encompass several distinct but connected stages: 

  • Data ingest
  • Prep, training
  • Fine-tuning
  • Inference
  • Archive 

NAS serves as a centralized, scalable, and accessible storage architecture that supports AI processes by providing high-speed access to vast amounts of data across distributed systems. Unlike Direct-Attached Storage (DAS), which reside locally in GPU compute servers, NAS enables seamless data sharing among multiple servers, GPUs, and edge devices, making it ideal for the collaborative and iterative nature of AI workflows. NAS systems, ideally optimized with high-capacity SSDs, make potentially massive datasets readily available for processing, minimizing latency while maximizing GPU utilization. 

Visualization of the AI data pipeline storage requirements.

Data quantity: The lifeblood of machine learning

In machine learning, data quantity and diversity are critical drivers of model performance. The more varied and voluminous the training data, the better a model is in generalizing to real-world scenarios. Increasing the quantity of training data can lead to significant performance improvements, as models learn from a broader range of patterns and edge cases.

However, raw data inputs are just a starting point. AI data pipelines generate and use incremental datasets throughout their operation, including:

  • Transformed Data: Raw data from sources like IoT devices, social media, or medical imaging is extracted, cleaned, and reformatted during the transformation phase, creating new datasets that require storage.
  • Checkpoints and Model Artifacts: During training, models generate checkpoints to save progress, which can consume significant storage, especially for large-scale models with frequent saves.
  • Synthetic Data: Artificially generated data used to augment training datasets, test model accuracy, or address data privacy concerns.
  • Inference Outputs: Real-world data processed during the inference phase produces outputs that are often stored for analysis or retraining, further increasing storage demands.
  • Retrieval-Augmented Generation (RAG) Database Scaling: Solutions that rely on large datasets or high dimensionality for high-quality insights may require more capacity than memory can offer.
  • Key-Value (KV) Cache Overflow in Inference: Applications with large models, long queries, or multi-turn interactions may generate more KV states than local memory can hold.
  • Archived Data: Long-term storage of raw, transformed, and processed data is essential for compliance, retraining, or auditing purposes.

These datasets are incremental and have the effect of massively amplifying storage requirements beyond the initial raw dataset size, making high capacity drives a valuable ingredient in a NAS solution.

What do users value in a NAS solution? Capacity, power, and space

In a recent survey, global data center infrastructure provider Digital Realty asked its customers to rank their top impediments to adopting a formal AI strategy.3 The first challenge was the lack of data storage required to house massive datasets, followed closely by the lack of available power for compute, and a lack of sufficient space for data storage.  

Solidigm D5-P5336 SSD in capacities up to 122TB

Solidigm’s answer to these challenges is the world’s highest-capacity PCIe SSD, the 122.88TB Solidigm™ D5-P5336. This drive offers unmatched drive capacity, power efficiency, and data density, making it a game-changer for AI-driven NAS deployments. Its key features include:

  • The World’s Highest Capacity PCIe SSD: The 122TB D5-P5336 can store massive datasets, every single NFL game in 4k from the last 15 seasons.
  • Power and Space Efficiency: The D5-P5336 consumes up to 90% less power5 in NAS deployments compared to legacy hybrid HDD + TLC SSD solutions and delivers 2.5x more terabytes per watt than 30TB TLC SSDs.
  • Compact Footprint: The D5-P5336 achieves up to a 9:1 NAS footprint reduction,6 enabling data centers to scale storage while minimizing physical infrastructure.
  • Unlimited Random Write Endurance: 24/7 round-the-clock 32KB or 4KB random writes will not wear out this drive over the span of its five-year warranty. 

These attributes align nicely with AI data pipeline needs, particularly in the ingest and archive stages, where the value of high-capacity, durable, and efficient storage is paramount.

Mapping high-capacity SSD value to AI data pipeline stages

The D5-P5336 delivers tailored benefits to the ingest and archive stages of the AI data pipeline, addressing the unique challenges of each:

  • Ingest: Here, raw data from diverse sources such as event logs, CRM systems, or LIDAR data is written to storage at high velocity. This phase induces large sequential write activity, which the D5-P5336 handles efficiently and durably with its unlimited write endurance. The drive’s massive capacity allows NAS systems to accommodate the “3V” characteristics of big data—volume, velocity, and variety—while remaining scalable for future data growth.
  • Archive: This stage involves storing raw, transformed, and processed data for the long-haul, with data privacy regulations factoring in as well. At 122.88TB, D5-P5336-based NAS systems can deliver a mind-blowing 53 petabytes of raw capacity in a single 42U NAS rack,5 massively reducing the physical and power footprint of archival storage. 90% less power for storage versus hybrid solutions results in significant operating efficiency advantages when storing data for retraining or compliance.
  • Prep & Inference: NAS works in conjunction with DAS in data prep and inference stages to provide centralized, scalable storage, enabling efficient data access, management, and transfer for preprocessing, training, and real-time inference tasks.
General view of AI data pipeline from data ingest, data prep, ai training, ai inference, data archive.

By optimizing for these stages, high-capacity SSDs enhance the overall efficiency of the AI data pipeline, enabling faster data access, reduced operational costs, and scalability for future storage growth.

Conclusion

As AI continues to push the boundaries of data storage, NAS systems equipped with high-capacity SSDs are becoming essential for managing the volume, velocity, and variety of AI datasets. The 122.88TB Solidigm  D5-P5336 SSD represents a leap forward in storage technology, offering unmatched capacity, endurance, and efficiency for NAS deployments. 

Solidigm Advantage in AI data pipeline from Solidigm SSD portfolio.

By addressing the data-intensive stages of the AI data pipeline, the Solidigm D5-P5336 enables organizations to harness the full potential of their data, driving model performance improvements with not only more data, but more diverse data. As AI workloads evolve, the combination of NAS and high-capacity SSDs will remain a cornerstone of scalable, efficient, high-performance data infrastructure.

For more information, visit the Solidigm D5-P5336 122.88TB SSD product brief. 


About the Author

Dave Sierra is a Product Marketing Analyst at Solidigm, where he focuses on solving the infrastructure efficiency challenges that face today's data centers.

Notes

1 Source – Epoch.AI, https://epoch.ai/data/notable-ai-models#Documentation 

2 Assumes an average typing speed of 60 wpm at an average token size of 5 characters

3 Source: Digital Realty, Global Data Insights Survey, August 2024

4 Assumes 30GB per-game file size based on 4k video bitrate at 25 Mbps and H.265/HEVC compression. Based on average NFL game length of 3 hours and 272 games per season.

5 Source – Solidigm. Some results have been estimated or simulated using internal Solidigm analysis or architecture simulation or modeling and provided to you for information purposes only. Any differences in your system hardware, software or configuration may affect your actual performance.

6 Source - Solidigm. Based on a 42U NAS rack with 36U available for storage, 18x 2U storage servers, each with 24x 122.88TB SSDs.

Disclaimers

Nothing herein is intended to create any express or implied warranty, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, or any warranty arising from course of performance, course of dealing, or usage in trade.

The products described in this document may contain design defects or errors known as “errata,” which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Solidigm does not control or audit third-party data. You should consult other sources to evaluate accuracy.

Contact your Solidigm representative or your distributor to obtain the latest specifications before placing your product order. 

SOLIDIGM and the Solidigm “S” logo are trademarks of SK hynix NAND Product Solutions Corp. (d/b/a Solidigm), registered in the United States, People’s Republic of China, Japan, Singapore, the European Union, the United Kingdom, Mexico, and other countries.