20,000,000,000,000 tokens or 20 trillion if you prefer. That’s the estimated number of training dataset tokens in Qwen2.5-MAX, the latest LLM foundation model from Alibaba.1 If you and your descendants were (unfortunately) enlisted to serially type out every token, it would take about 600,000,000 years,2 so plan accordingly. This is all to say that modern AI is an information glutton so storing and activating AI data inputs and outputs create unprecedented storage capacity and efficiency challenges. Network Attached Storage (NAS) solutions have emerged as a critical component in managing the massive datasets required for AI model development and deployment. We will explore the pivotal role of NAS in AI, the value of high-capacity Solid-State Drives (SSDs), and where and how these drives fit in the overall AI data pipeline.
Ultimately, AI is the product of data and intense compute, but turning raw bits into sharp models takes work. AI data pipelines encompass several distinct but connected stages:
NAS serves as a centralized, scalable, and accessible storage architecture that supports AI processes by providing high-speed access to vast amounts of data across distributed systems. Unlike Direct-Attached Storage (DAS), which reside locally in GPU compute servers, NAS enables seamless data sharing among multiple servers, GPUs, and edge devices, making it ideal for the collaborative and iterative nature of AI workflows. NAS systems, ideally optimized with high-capacity SSDs, make potentially massive datasets readily available for processing, minimizing latency while maximizing GPU utilization.
In machine learning, data quantity and diversity are critical drivers of model performance. The more varied and voluminous the training data, the better a model is in generalizing to real-world scenarios. Increasing the quantity of training data can lead to significant performance improvements, as models learn from a broader range of patterns and edge cases.
However, raw data inputs are just a starting point. AI data pipelines generate and use incremental datasets throughout their operation, including:
These datasets are incremental and have the effect of massively amplifying storage requirements beyond the initial raw dataset size, making high capacity drives a valuable ingredient in a NAS solution.
In a recent survey, global data center infrastructure provider Digital Realty asked its customers to rank their top impediments to adopting a formal AI strategy.3 The first challenge was the lack of data storage required to house massive datasets, followed closely by the lack of available power for compute, and a lack of sufficient space for data storage.
Solidigm’s answer to these challenges is the world’s highest-capacity PCIe SSD, the 122.88TB Solidigm™ D5-P5336. This drive offers unmatched drive capacity, power efficiency, and data density, making it a game-changer for AI-driven NAS deployments. Its key features include:
These attributes align nicely with AI data pipeline needs, particularly in the ingest and archive stages, where the value of high-capacity, durable, and efficient storage is paramount.
The D5-P5336 delivers tailored benefits to the ingest and archive stages of the AI data pipeline, addressing the unique challenges of each:
By optimizing for these stages, high-capacity SSDs enhance the overall efficiency of the AI data pipeline, enabling faster data access, reduced operational costs, and scalability for future storage growth.
As AI continues to push the boundaries of data storage, NAS systems equipped with high-capacity SSDs are becoming essential for managing the volume, velocity, and variety of AI datasets. The 122.88TB Solidigm D5-P5336 SSD represents a leap forward in storage technology, offering unmatched capacity, endurance, and efficiency for NAS deployments.
By addressing the data-intensive stages of the AI data pipeline, the Solidigm D5-P5336 enables organizations to harness the full potential of their data, driving model performance improvements with not only more data, but more diverse data. As AI workloads evolve, the combination of NAS and high-capacity SSDs will remain a cornerstone of scalable, efficient, high-performance data infrastructure.
For more information, visit the Solidigm D5-P5336 122.88TB SSD product brief.
Dave Sierra is a Product Marketing Analyst at Solidigm, where he focuses on solving the infrastructure efficiency challenges that face today's data centers.
1 Source – Epoch.AI, https://epoch.ai/data/notable-ai-models#Documentation
2 Assumes an average typing speed of 60 wpm at an average token size of 5 characters
3 Source: Digital Realty, Global Data Insights Survey, August 2024
4 Assumes 30GB per-game file size based on 4k video bitrate at 25 Mbps and H.265/HEVC compression. Based on average NFL game length of 3 hours and 272 games per season.
5 Source – Solidigm. Some results have been estimated or simulated using internal Solidigm analysis or architecture simulation or modeling and provided to you for information purposes only. Any differences in your system hardware, software or configuration may affect your actual performance.
6 Source - Solidigm. Based on a 42U NAS rack with 36U available for storage, 18x 2U storage servers, each with 24x 122.88TB SSDs.
Nothing herein is intended to create any express or implied warranty, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, or any warranty arising from course of performance, course of dealing, or usage in trade.
The products described in this document may contain design defects or errors known as “errata,” which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Solidigm does not control or audit third-party data. You should consult other sources to evaluate accuracy.
Contact your Solidigm representative or your distributor to obtain the latest specifications before placing your product order.
SOLIDIGM and the Solidigm “S” logo are trademarks of SK hynix NAND Product Solutions Corp. (d/b/a Solidigm), registered in the United States, People’s Republic of China, Japan, Singapore, the European Union, the United Kingdom, Mexico, and other countries.