Share

Published February 13, 2024

Author Sarika Mehta, Wayne Gao, Yi Wang

Article

The Role of Solidigm SSDs in AI Storage Advancement

As artificial intelligence rapidly advances to fuel humanity’s aspirations, computing power has had to grow as well. Clusters of thousands of GPUs are emerging everywhere, powered by high-throughput, low-latency networks and deep learning models. This evolving market prompts deep contemplation among AI architects. One of the most important questions is this: What kind of AI storage infrastructure can keep AI accelerators (GPUs, CPUs, and others) and network devices running at full capacity without idle time?

Phases of an AI project cycle

An analysis of industry practices reveals that the typical AI project cycle consists of three main phases: 

  1. Data import and preparation
  2. Model development (training)
  3. Model deployment (inference) 

A fourth (optional) phase may involve iterative refinement of the model based on the actual inference results and new data. To understand AI's storage requirements, it's essential to understand the nature of primary input/output (I/O) operations in each phase and to consider them collectively to form a comprehensive view.

Phase 1: Data ingestion and preparation

Before diving into training, it's crucial to thoroughly prepare the data that will be fed into the training cluster.

1. Data transformation: Discovery, extraction, and preprocessing

The raw data used for AI modeling inherits the classic “3V” characteristics of traditional big data: volume, velocity, and variety. The sources of data can range from events logs, transaction records, and IoT inputs to CRM, ERP, social media, satellite imagery, economics, and stock transactions. It’s necessary to extract data from these diverse sources and integrate it into a temporary storage area within the data pipeline. This step is typically referred to as "extraction."

The data is then transformed into the suitable format for further analysis. In its original source systems, data is often chaotic, making it challenging to interpret. Part of the transformation objective is to enhance data quality. These include: 

  1. Cleaning up invalid data
  2. Removing duplicates
  3. Standardizing measurement units
  4. Organizing data based on its type

During the transformation phase, data is also structured and reformatted to align with its specific business purposes. This step is referred to as "transformation."

2. Data exploration and data set split

Data analysts employ visualization and statistical techniques to describe the characteristics of a dataset, such as its scale, quantity, and accuracy. Through exploration, they identify and examine relationships between different variables, the structure of the dataset, the presence of anomalies, and the distribution of values. Data exploration allows analysts to delve deeper into the raw data.

Exploration aids in identifying glaring errors, gaining a better understanding of patterns within the data, detecting outliers or unusual events, and uncovering intriguing relationships between variables. Once data exploration is complete, the dataset is typically split into training and testing subsets. These subsets are used separately during model development for training and testing purposes.

3. Feature extraction, feature selection, and pattern mining

The success of an AI model hinges on whether the selected features can effectively represent the classification problem under study. 

For instance, consider an individual member of a choir: Features could include gender, height, skin color, and education level, or they could focus solely on vocal range.

Using vocal range as the feature, as opposed to the previous four dimensions, reduces the dimensionality to just one-fourth (meaning significantly less data), yet it might better encapsulate the relevant essence of the choir member. 

To avoid the perils of high dimensionality and reduce computational complexity, the process of identifying the most effective features to lower feature dimensionality is known as feature selection.

Within an array of features, uncovering their intrinsic relationships and logic, such as which ones are mutually exclusive or which ones coexist, is termed pattern mining.

4. Data transformation 

For various reasons, a need might arise to transform data. It could be driven by a desire to align some data with others, facilitate compatibility, migrate portions of data to another system, establish connections with other datasets, or aggregate information within the data. 

Common aspects of data transformation include converting types, altering semantics, adjusting value ranges, changing granularity, splitting tables or datasets, and transforming rows and columns, among others.

Thanks to the mature open-source project communities, we have a wealth of reliable tools at our disposal during the data ingest and preparation phase. These tools enable us to accomplish ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) tasks. Examples include:

  • Kafka 
  • Sqoop 
  • Flume 
  • Spark
  • Nifi

Additionally, for tasks like creating numerous sets of features, we can rely on tools such as:

  • Spark
  • Pandas
  • Numpy
  • Spark MLLib
  • scikit-learn
  • XGBoost

5. Storage characteristics for the data ingest and preparation phase 

During the data ingest and preparation phase, the typical workflow involves reading data randomly and writing processed items sequentially. It's imperative for the storage infrastructure to provide low latency for small random reads while simultaneously delivering high sequential write throughput.

Phase 2: Model development and training

Once the preparation of the training dataset is complete, the next phase involves model development, training, and hyperparameter tuning. The choice of algorithm is determined by the characteristics of the use case, and the model is trained using the dataset.

1. AI Frameworks 

Model efficiency is assessed against a test dataset, adjusted as needed, and finally deployed. The AI framework is continually evolving, with popular frameworks including:

  • TensorFlow
  • PyTorch
  • MXNet
  • Scikit Learn
  • H2O
  • And more

This stage places extremely high demands on computational resources. Storage is crucial, as feeding data to these resources faster and more efficiently becomes a priority to eliminate resource idleness.

During model development, the dataset continuously expands, often with numerous data scientists requiring simultaneous access from different workstations. They dynamically augment entries among thousands of variations to prevent overfitting.

2. Storage capacity scalability and data sharing 

Storage capacity starts to become crucial at this stage, but with the increasing number of concurrent data access operations, scalable performance is the true key to success. Data sharing between workstations and servers, along with rapid and seamless capacity expansion, are vital storage capabilities.

As training progresses the dataset's size multiplies, often reaching several petabytes. Each training job typically involves random reads, and the entire process comprises numerous concurrent jobs accessing the same dataset. Multiple jobs competing for data access intensify the overall random I/O workload.

The transition from model development to training necessitates storage that can scale without interruption to accommodate billions of data items. It also requires rapid, multi-host random access—specifically, high random read performance. 

Training jobs often decompress input data, augment or perturb it, randomize input order, and require data item enumeration to query storage for lists of training data items, especially in the context of billions of items.

3. Checkpointing: Burst of large sequential writes

The sheer scale of training leads to new demands. Today's training jobs may run for days or even months. Consequently, most jobs write periodic checkpoints to recover quickly from a failure, minimizing the need to restart from scratch. 

Therefore, the primary workload during training consists of random reads, occasionally punctuated by large sequential writes during checkpoints. Storage systems should be capable of sustaining the intensive random access required by concurrent training jobs, even during bursts of large sequential writes during checkpointing.

4. Summary of the model development phase 

In summary, AI model development is a highly iterative process where successive experiments confirm or refute hypotheses. As models evolve, data scientists train them using sample datasets, often through tens of thousands of iterations. 

For each iteration, they augment data items and slightly randomize them to prevent overfitting, creating models that are accurate for the training dataset but also adaptable to live data. As training advances, datasets grow, leading to a transition from data scientists' workstations to data center servers with greater computational and storage capabilities.

Phase 3: Model deployment and inference

Once model development is completed, it's time to deploy and to go live with the service. In this inference phase, real-world data is fed into the model, and ideally, the output provides valuable insights. Often, the model undergoes ongoing fine-tuning. New real-world data imported into the model during the inference phase is incorporated into the retraining process, enhancing its performance.

1. Fine-tuning during real-world application

AI storage infrastructure must operate seamlessly around the clock throughout a project's lifecycle. It should possess self-healing capabilities to handle component failures and enable non-disruptive expansion and upgrades.

Data scientists' need for production data to fine-tune models and explore changing patterns and objectives underscores the importance of a unified platform—a single storage system that serves all project phases. Such a system allows development, training, and production to easily access dynamically evolving data.

2. Model readiness for production 

When a model consistently produces accurate results, it's deployed into production. The focus then shifts from refining the model to maintaining a robust IT environment. Production can take various forms, whether interactive or batch oriented. Continuous use of new data is instrumental in refining models for enhanced accuracy, and data scientists regularly update training datasets while analyzing model outputs.

The table below summarizes each phase of the AI project cycle and their respective I/O characteristics and consequent storage requirements.

AI Phase I/O Characteristics Storage Requirements Impact
Data ingestion and preparation Reading data randomly; writing preprocessed items sequentially Low latency for small random reads; high sequential write throughput Optimized storage means the pipeline can offer more data for training, which leads to more accurate models
Model development (training) Random data reads Scalability in multi-job performance and capacity; optimized random reads; high sequential write performance for checkpointing Optimized storage improves utilization of expensive training resources (GPU, TPU, CPU)
Model deployment (inference) Mixed random reads and writes

Self-healing capabilities to handle component failures; non-disruptive expansion and upgrades;

same features as training stage if the model undergoes ongoing fine-tuning

Business requires high availability, serviceability, and reliability

Table 1. AI project cycle with I/O characteristics and consequent storage requirements

Key storage characteristics for AI deployments

An AI project that starts with a single-chassis system during initial model development needs the flexibility to expand as data requirements grow during training and as more live data is accumulated in production. To achieve higher capacity, two key strategies are employed at the infrastructure level: increasing individual disk capacity, and expanding the cluster size of storage enclosures. 

1. Capacity 

Enhancing the capacity of individual disks and improving the horizontal scalability of storage nodes are crucial factors. At the disk level, products like the Solidigm D5-P5336 QLC SSD are now reaching capacities of up to 61.44TB. At the storage enclosure level, EDSFF (Enterprise and Datacenter Standard Form Factor) showcases unmatched storage density.

For U.2 15mm form factor drives, a typical 2U enclosure accommodates 24 to 26 disks, resulting in a raw capacity of up to 1.44PB. Upgrading to E1.L 9.5mm form factor disks, as Figure 1 shows, a 1U enclosures support 32 disks. With 2U, the storage density is roughly 2.6x higher than a 2U U.2 enclosure. A comparison is shown in Table 2.

Form Factor Number of 60TB drives in 2U rack space Capacity per 2U rack space
Legacy U.2 15mm 24 1.47PB
E1.L 9.5mm 64 3.93PB

Table 2. 2U rack unit capacity based on drive form factor

It's worth noting that higher storage density in a single enclosure translates to a significant reduction in rack space occupied by storage nodes, the number of network ports required, and the power, cooling, and spare parts needed to operate them when exposing the same capacity and the manpower.

2. Data-sharing capability

The data-sharing capability of storage is of paramount importance, given the collaborative parallel work of multiple teams mentioned earlier and the desire to train more data before delivery. This is reflected in the high IOPS, low latency, and bandwidth of the storage network. Additionally, support for multipath is essential so that network services can continue operating in the event of network component failures.

Over time, off-the-shelf networks have now consolidated to Ethernet and InfiniBand. InfiniBand has a rich data rate, leading bandwidth and latency performance, and RDMA native support. As a result, InfiniBand is a strong network to support AI storage. The most popular Ethernet bandwidths now are 25Gbps, 40Gbps, and 100Gbps. NVIDIA also has products that support 200Gbps and 400Gbps with low latency. For east-west data flow among network storage, nodes have a storage VLAN. NVIDIA also has products that support 200Gbps and 400Gbps RDMA.

3. Adaptability to varying I/O 

AI storage performance should be consistent for all types of I/O operations. Every file and object should be accessible in approximately the same amount of time ensuring that time-to-first-byte access remains consistent, whether it's for small 1KB item labels or large 50MB images.

4. Parallel network file operations

In AI projects, common tasks such as bulk copying, enumeration, and changing properties require efficient parallel network file operations. These operations significantly expedite AI model development. Originally developed by Sun Microsystems in 1984, NFS (Network File System) remains the most dominant network file system protocol to date. NFS over Remote Direct Memory Access (NFSoRDMA) is particularly suited for compute-intensive workloads that involve transferring large volumes of data. RDMA's data movement offload feature reduces unnecessary data copying, enhancing efficiency.

5. Summary of key AI storage characteristics

AI storage solutions should offer ample capacity, robust data sharing capabilities, consistent performance for varying I/O types, and support for parallel network file operations. These requirements ensure that AI projects can effectively manage their growing datasets and meet the performance demands of AI model development and deployment.

Conclusion

AI development continues to surpass our wildest expectations. In the urgency to feed computational behemoths with more data at a faster rate, there is no room for idle processing time and power. Solidigm offers drives in various form factors, densities and price points to meet the needs of various AI deployments. High density QLC SSDs have been proven in performance, capacity, reliability, and cost. 

Legacy configuration rack with TLC SSDs to new configuration rack with SLC and TLC SSDs plus QLC SSDs.

Figure 1. TLC-only solution transitions to SLC/TLC+QLC

Together with CSAL and the Solidigm D7-P5810 SLC SSD, customers can harness the ability to tune their deployments in performance, cost, and capacity.1 It is clear that with an innovative complete stack and open source storage solution, Solidigm SSDs have a unique advantage in bolstering AI storage advancement.

Traditional write cache vs write shaping cache with CSAL

Figure 2. CSAL architecture

 

About the Authors

Sarika Mehta is a Storage Solutions Architect at Solidigm with over 15 years of storage experience. Her focus is to work closely with Solidigm customers and partners to optimize their storage solutions for cost and performance. 

Wayne Gao is a Principal Engineer as Storage Solution Architect at Soldigm. Wayne has worked on CSAL from Pathfinding to Alibaba commercial release. Wayne has over 20 years of storage developer experience as a member of the Dell EMC ECS all-flash object storage team and has 4 US patent filings/grants and is a published EuroSys paper author. 

Yi Wang is a Field Application Engineer at Solidigm. Before joining Solidigm, he held technical roles with Intel, Cloudera, and NCR. He holds "Cisco Certified Network Professional," "Microsoft Certified Solutions Expert," and "Cloudera Data Platform Administrator" certifications.

Notes

[1] CSAL introduction https://www.solidigm.com/products/technology/cloud-storage-acceleration-layer-write-shaping-csal.html

[2] https://www.solidigmtechnology.com/products/technology/solidigm-ssds-in-superpod-ai-storage-nvidia-vast-data.html

 

The Role of Solidigm SSDs in AI Storage Advancement