Optimizing AI Workloads With Solidigm™ Solid-State Drives

Performance insights from MLPerf inference, training, and storage

Abstract

The evolution of artificial intelligence (AI) workloads has amplified the demand for efficient storage and compute solutions to optimize performance across training and inference tasks. This study leverages MLPerf benchmarks—Inference v4.1, Training v4.1, and Storage v1.0—to evaluate the impact of Solidigm SSDs, specifically the D7-PS1010 (PCIe Gen5), D5-P5336 (PCIe Gen4), and D3-S4520 (SATA), on AI efficiency. Results reveal that inference performance remains largely unaffected by disk configurations, as it depends primarily on GPU capabilities and memory bandwidth, with no significant gains from additional SSDs. In contrast, training workloads, particularly data-intensive models like DLRMv2, exhibit substantial performance improvements with high-speed NVMe SSDs, where the D7-PS1010 outperforms the D5-P5336 in configurations with fewer disks, although gains plateau with scaling. The MLPerf Storage benchmark further confirms NVMe’s superiority over SATA, with D7-PS1010 achieving peak throughput with fewer disks compared to D5-P5336, while D3-S4520 proves inadequate for modern AI demands. These findings underscore the need for tailored storage strategies, with high-performance NVMe for training and a focus on compute optimization for inference, highlighting the critical role of infrastructure balance in maximizing AI system efficiency.

Introduction

The increasing complexity of artificial intelligence (AI) workloads has placed unprecedented demands on system performance, necessitating a nuanced understanding of how storage and compute components influence efficiency. The MLPerf benchmarking suites—Inference, Training, and Storage—provide a standardized framework to assess AI system performance across diverse hardware configurations, offering critical insights into optimizing these workloads. 

MLPerf Inference evaluates real-time prediction tasks, where efficiency hinges on model execution speed, typically within memory, rendering disk performance a secondary factor. Conversely, MLPerf Training examines the process of building AI models from scratch, a phase heavily reliant on storage throughput due to extensive data access requirements, especially for tasks like recommendation systems and image processing. Complementing these, the MLPerf Storage benchmark isolates storage performance under AI-specific data pipelines, addressing the growing need for scalable, high-throughput solutions in data-intensive applications.

This study investigates the interplay between storage configurations and AI performance using Solidigm NVMe SSDs—D7-PS1010, D5-P5336, and D3-S4520—across two server platforms, QuantaGrid D74H-7U and D54U-3U. Findings indicate that inference workloads are compute- and memory-bound, showing negligible benefits from enhanced storage setups, while training and storage benchmarks reveal significant advantages with NVMe SSDs, particularly for models like DLRMv2 that demand rapid data retrieval. By analyzing these outcomes, this study underscores the pivotal role of high-performance storage in training scenarios and the primacy of GPU and memory optimization in inference, providing actionable guidance for designing efficient AI infrastructures. These insights aim to inform stakeholders in academia and industry on aligning hardware choices with workload-specific needs to achieve scalability and sustained performance.

Benchmarking Setup and Methodology

Solidigm D7-PS1010 U.2 high performance SSD
Solidigm D5-P5336 U.2 high density SSD

Hardware and Software Configuration

To evaluate the impact of storage configurations on AI workloads, MLPerf inference, training, and storage benchmarks were conducted on two server platforms. 

System QuantaGrid D74H-7U QuantaGrid D54U-3U
CPU

Intel Xeon Platinum 8480+

56 cores x 2

Intel Xeon Platinum 8470

 52 core x 2

RAM 2 TB (DDR5-4800 64GB x 32) 2TB (DDR5-4800 64GB x 32)
OS Disk Samsung PM9A3 3.84TB x 1 Samsung PM9A3 1.92TB x 1
Data Disk

Solodigm D7-PS1010 U.2 7.68TB x 8

S olodigm D5-P5336 U.2 15.36TB x 8

S olodigm D7-PS1010 U.2 7.68TB x 8

S olodigm D5-P5336 U.2 15.36TB x 8

S olodigm D3-S4520 SATA 7.68TB x 8

Accelerator H100 SXM5 80GB x 8 H100 PCIe 80GB x 4
BIOS Configuration

Profile: Performance

Enable LP [Global]: ALL LPs

SNC: Disable

OS Rocky Linux release 9.2 (Blue Onyx)
Kernel 5.14.0-362.18.1.el9_3.x86_64 5.14.0-284.11.1.el9_2.x86_64
Framework

GPU Driver 550.127.08 

CUDA 12.4 + GDS 12.4

GPU Driver 550.90.07 

CUDA 12.4

Table 1: The configuration of QuantaGrid D74H-7U and D54U-3U with SOLIDIGM’s different NVMe solutions.

  Solidigm™ D7-PS1010 (Form Factor: U.2) Solidigm™ D5-P5336 (Form Factor: U.2) Solidigm™ D3-S4520 (Form Factor: U.2)
  Image Image Image
Capacity 7.68 TB 15.36TB 7.68 TB
Lithography Type 176L TLC 3D NAND 192L QLC 3D NAND 144L TLC 3D NAND
Interface PCIe 5.0 x4, NVMe PCIe 4.0 x4, NVMe SATA 3.0 6Gb/s
Sequential Read (up to) 14,500 MB/s 7,000 MB/s 550 MB/s
Sequential Write (up to) 9,300 MB/s 3,000 MB/s 510 MB/s
Random Read (up to) 2,800,000 IOPS (4K) 1,005,000 IOPS (4K) 86,000 IOPS (4K)
Random Write (up to) 400,000 IOPS (4K) 24,000 IOPS (4K) 30,000 IOPS (4K)

Table 2: Specification of SOLIDIGM PS1010, P5336 and S4520

MLPerf inference testing was performed on both platforms, with each system evaluated in "Server" and "Offline" modes to simulate real-world AI inference environments. The benchmarks assessed performance across different storage configurations using 1, 2, 4, and 8 drives to analyze scalability and throughput efficiency. Given that inference workloads primarily rely on memory and GPU performance, the objective was to determine whether increased disk count had any measurable impact on performance.

For MLPerf training and storage benchmarks, both QuantaGrid D74H-7U and D54U-3U were utilized for training workloads, while D54U-3U was also used for storage performance evaluation. Training tests examined the relationship between storage configurations and AI model performance. Storage benchmarks analyzed disk throughput and efficiency under AI-specific workloads to assess the benefits of NVMe SSDs over SATA alternatives.

In configurations utilizing 2, 4, and 8 drives, a software RAID0 setup was implemented to optimize read and write speeds, ensuring efficient data distribution across SSDs. To fully leverage SSD performance, all NVMe SSDs were directly connected via the CPU’s PCIe lanes or through a PCIe switch. Hardware RAID was avoided to prevent potential bandwidth limitations imposed by RAID controllers, ensuring that AI workloads could maximize storage throughput without PCIe lane constraints.

MLPerf Workloads

This section outlines the MLPerf benchmarking suites—Inference v4.1, Training v4.1, and Storage v1.0—developed by the MLCommons Association to evaluate AI system performance across inference, training, and storage workloads. These suites provide standardized, reproducible methodologies to assess hardware and software efficiency, ensuring fairness, comparability, and flexibility through distinct divisions and rules.

MLPerf Inference v4.1

MLPerf Inference v4.1 is designed to measure the performance of AI systems during real-time inference tasks, focusing on execution speed, latency, and accuracy. It evaluates a diverse set of workloads, including BERT [1], ResNet-50 [2], RetinaNet [3], 3D-Unet [4], DLRMv2 [5], GPT-J [6], Llama2-70B [7], Mixtral-8x7B [8], Stable Diffusion XL (SDXL) [9], using standardized system configurations and frameworks. The suite quantifies key metrics  such as latency, throughput, and efficiency while ensuring that model accuracy meets predefined standards, supporting platforms ranging from low-power edge devices to high-performance data center servers. It promotes openness and comparability across vision, language, commerce, generative, and graph domains, addressing diverse deployment contexts.

Key Definitions

In MLPerf Inference, critical terms include:

  • Sample: The unit of inference, such as an image, sentence, or node ID (e.g., one image for ResNet-50, one sequence for BERT).
  • Query: A set of N samples issued together to the system under test (SUT), where N is a positive integer (e.g., 8 images per query).
  • Quality: The model’s ability to produce accurate outputs.
  • System Under Test (SUT): A defined set of hardware (e.g., processors, accelerators, memory) and software resources measured for performance.
  • Reference Implementation: The canonical implementation provided by MLPerf, to which all valid Closed submissions must be equivalent.

Testing Scenarios

MLPerf Inference incorporates four distinct testing scenarios to mirror real-world inference workloads, as detailed in the following table:

Scenario Purpose Use Case Metric
Single Stream Assesses latency for a single query stream Real-time applications like voice recognition or live video analysis Time to process each query
Multi-Stream Tests performance across multiple simultaneous streams Multi-user systems, such as video streaming or chatbots Latency and throughput with concurrent queries
Server Evaluates handling of dynamic, online query loads Cloud inference services with fluctuating demand Queries per second (QPS) under latency constraints
Offline Measures throughput for large batch processing Bulk tasks like dataset analysis or media indexing Total queries processed in a set timeframe

Submission Divisions

MLPerf Inference features two divisions: Closed and Open. The Closed Division mandates equivalence to the reference or alternative implementation, allowing calibration for quantization but prohibiting retraining. The Open Division permits arbitrary pre-/post-processing and models, including retraining, with reported accuracy and latency constraints, fostering innovation but sacrificing comparability.

Methodology and Workflow

The MLPerf Inference process relies on the Load Generator (LoadGen), a C++ tool with Python bindings, to simulate queries, track latency, validate accuracy, and compute metrics. LoadGen operates on the processor simulating queries from logical sources and stores traces in DRAM, adhering to bandwidth requirements. Figure 1 illustrates the simplified workflow of MLPerf Inference from configuration to the validation stage, incorporating a failure scenario where validation does not pass, requiring reconfiguration of the System Under Test (SUT).

Simplified MLPerf inference workflow to validation stage Figure 1. Simplified MLPerf inference workflow to validation stage

These scenarios ensure comprehensive coverage of latency-critical and throughput-focused applications, with early stopping criteria allowing shorter runs while maintaining statistical validity.

Rules and Guidelines

Rules ensure fairness, requiring consistent systems and frameworks, open-sourcing code, restricting non-determinism to fixed seeds, and prohibiting benchmark detection or input-based optimization. Replicability is mandatory, with audits validating compliance, particularly for Closed submissions.

Use Cases and Impact

MLPerf Inference supports edge computing, cloud infrastructure, and specialized domains, optimizing real-time inference and scalability, driving efficient AI solution development.

MLPerf Training v4.1

MLPerf Training v4.1 establishes standardized benchmarks to measure training performance, defined as execution speed, across diverse ML tasks. It evaluates workloads like BERT, DLRMv2, GNN (R-GAT) [10], Low-Rank Adaptation (LoRA) [11], Stable Diffusion (SD) [12], and Single-Shot Detector (SSD) [13] ensuring fairness through defined rules. Performance and quality are key metrics, with results eligible for the MLPerf trademark upon compliance. The suite encompasses systems, frameworks, benchmarks, and runs normalized against reference results.

Key Definitions

Key terms include:

  • Performance: Execution speed of training.
  • Quality: Model accuracy in generating correct outputs.
  • System: Hardware and software influencing runtime, excluding ML frameworks.
  • Framework: A specific ML library version.
  • Benchmark: An abstract ML problem solved by training to a quality target.
  • Run: Complete training from initialization to quality target, measured in wall-clock time.
  • Reference Implementation: MLPerf-provided implementation defining benchmark standards.

Benchmarks and Divisions

The suite spans vision, language, commerce, and graph domains, with Closed and Open divisions. Closed mandates reference preprocessing, models, and targets, ensuring comparability, while Open allows flexibility in data and methods, requiring iterative improvement and benchmark dataset alignment.

Methodology and Workflow

Training adheres to reference models, weights, optimizers, and hyperparameters, with random number generation (Closed: stock, clock-seeded via mllog) and numerical formats (Closed: pre-approved, e.g., fp32, fp16) restricted. Data handling ensures reference consistency, with quality evaluated at specified frequencies, results derived from multiple runs and normalized against references. Figure 2 illustrates the simplified workflow of MLPerf Training from system definition to the convergence check stage, incorporating a failure scenario where convergence does not meet Reference Convergence Points (RCPs), requiring adjustments to hyperparameters or re-running the process.

Simplified MLPerf training workflow to convergence check Figure 2. Simplified MLPerf training workflow to convergence check

Rules and Guidelines

Fairness is paramount, prohibiting benchmark detection, pre-training (except metadata), and requiring replicability. Reference Convergence Points (RCPs) ensure submission convergence aligns with reference, with audits and hyperparameter borrowing allowed to optimize performance.

Use Cases and Impact

MLPerf Training supports AI model development in vision, language, and commerce, optimizing training for data centers, enhancing scalability, and driving hardware/software innovation.

MLPerf Storage v1.0

MLPerf Storage v1.0 evaluates storage system performance for ML workloads, emulating accelerator demands via sleep intervals to isolate data-ingestion pipelines, enabling scalable testing without compute clusters. It focuses on storage scalability and performance, supporting workloads like 3D U-Net, ResNet-50, and CosmoFlow.

Key Definitions

Key terms include:

  • Sample: The unit of data for training, e.g., an image or sentence (e.g., 140MB/sample for 3D U-Net).
  • Step: The first batch loaded into the emulated accelerator.
  • Accelerator Utilization (AU): Percentage of time emulated accelerators are active relative to total runtime (e.g., ≥90% for 3D U-Net).
  • Division: Rules for comparable results (Closed, Open).
  • DLIO: Core benchmarking tool emulating I/O patterns. DLIO (Deep Learning I/O) [14] is an open-source benchmarking suite from Argonne National Laboratory, originally designed for HPC systems such as the Theta supercomputer. By profiling and modeling I/O behaviors of scientific deep learning workloads, DLIO accurately reproduces realistic data-ingestion patterns at scale. This allows users to stress-test storage infrastructures under conditions typical of large, distributed ML training, without primarily measuring raw compute power..
  • Dataset Content: Data and capacity, not format (e.g., KiTS19 for 3D U-Net).
  • Dataset Format: Storage format (e.g., npz).
  • Storage System: Hardware/software providing storage services to host nodes.
  • Storage Scaling Unit: Minimum unit increasing storage performance/scale (e.g., nodes, controllers).
  • Host Node: Minimum unit increasing load, running simulators identically.

Benchmarks and Divisions

The suite simulates I/O patterns from MLPerf Training/HPC, measuring samples/second with minimum AU thresholds (e.g., 90% for ResNet-50). Closed standardizes parameters for comparability, restricting modifications, while Open allows customizations (e.g., DLIO changes) for innovation, requiring documentation.

Methodology and Workflow

MLPerf Storage uses DLIO to generate synthetic datasets (≥5x host DRAM, avoiding caching), calculating minimum sizes based on accelerators, memory, and steps. Single-host or distributed setups scale load, synchronizing via barriers, measuring performance as samples/second. Figure 3 illustrates the simplified workflow of MLPerf Storage from configuration to the validation of Accelerator Utilization (AU) threshold stage, incorporating a failure scenario where the AU threshold is not met, requiring adjustments to the storage system or configuration.

Simplified MLPerf storage workflow to AU threshold validation Figure 3. Simplified MLPerf storage workflow to AU threshold validation

Rules and Guidelines

Rules ensure fairness, requiring available systems (commercial within 6 months), fixed seeds, stable storage, no preloading, cleared caches, and 5% replicability over five runs. Audits verify compliance, with Closed submissions using provided scripts, Open allowing DLIO modifications.

Use Cases and Impact

MLPerf Storage optimizes storage for ML training, supporting large-scale data pipelines in vision, scientific computing, and more, guiding infrastructure planning for scalability and efficiency.

Results and Analysis

MLPerf Inference v4.1 Disk Configuration Performance Analysis

The performance evaluation of MLPerf Inference v4.1 using Solidigm D7-PS1010 (PCIe Gen5), D5-P5336 (PCIe Gen4), and D3-S4520 (SATA SSDs) across different RAID0 disk configurations indicates that increasing the number of disks has negligible impact on inference performance. Speed-up values remain nearly constant across all tested models, including ResNet50, RetinaNet, BERT, DLRMv2, 3D-Unet, SDXL, GPT-J, Llama2-70b, and Mixtral. 

From the provided Figure 4 to Figure 8, we observe that for D7-PS1010, D5-P5336, and D3-S4520, inference speed-up remains unchanged across different disk configurations. This suggests that MLPerf inference workloads are largely compute- and memory-bound rather than I/O-bound. Since inference primarily involves model execution in memory with minimal disk access, adding more storage devices does not provide measurable performance gains. 

Additionally, across the D74H-7U and D54U-3U platforms, the trend remains consistent, with no significant variation in speed-up observed between different SSD models or disk configurations. This further reinforces that MLPerf inference does not rely on disk I/O for performance improvements, and disk choice plays a minor role in overall system efficiency.

A notable case is Mixtral, a newly added model in MLPerf Inference v4.1, which has been optimized and quantized by NVIDIA for high-performance GPUs such as H100 and H200 SXM5. However, Mixtral does not fully support the H100 PCIe 80GB on the D54U-3U platform, leading to its omission from tests. 

These findings emphasize that for AI inference tasks, investing in additional high-speed SSDs may not yield significant benefits, and efforts should instead focus on optimizing compute acceleration and memory efficiency.

MLPerf inference with Solidigm D7-PS1010 on D74H-7U Figure 4. MLPerf inference with Solidigm D7-PS1010 on D74H-7U
MLPerf inference with Solidigm D5-P5336 on D74H-7U Figure 5. MLPerf inference with Solidigm D5-P5336 on D74H-7U
MLPerf inference with Solidigm D7-PS1010 on D54U-3U Figure 6. MLPerf inference with Solidigm D7-PS1010 on D54U-3U
MLPerf inference with Solidigm D5-P5336 on D54U-3U Figure 7 . MLPerf inference with Solidigm D5-P5336 on D54U-3U
MLPerf inference with Solidigm D3-S4520 on D54U-3U Figure 8. MLPerf inference with Solidigm D3-S4520 on D54U-3U

MLPerf Training v4.1 Disk Configuration Performance Analysis

Figure 9 and Figure 10 provide a comparative analysis of MLPerf training speed-up performance across varying storage configurations, utilizing the D7-PS1010 and D5-P5336 systems at D74H-7U. These visual representations highlight the scalability characteristics of multiple machine learning models, including BERT, DLRMv2, GNN, LoRA, Stable Diffusion (SD), and Single-Shot Detector (SSD), under different disk configurations (1, 2, 4, and 8 disks).

MLPerf Training with Solidigm D7-PS1010 on D74H-7U Figure 9. MLPerf Training with Solidigm D7-PS1010 on D74H-7U

In Figure 9, DLRMv2 and GNN showing the most noticeable improvements as the number of disks increases. DLRMv2 achieves a peak speed-up of 1.29x with 8 disks, while GNN reaches 1.10x. Other models display only marginal changes, suggesting limited disk I/O dependency.

MLPerf Training with Solidigm D5-P5336 on D74H-7U Figure 10. MLPerf Training with Solidigm D5-P5336 on D74H-7U

Table 3 presents the relative standard deviation (RSD) for each model under different storage configurations. The RSD values indicate significant variability in each model, suggesting that training performance is influenced by factors beyond disk I/O. This variability is further compounded by the impact of random seed selection, which affects training convergence and computational efficiency across multiple runs.

AI Model 

/

 # of Device

Solidigm D7-PS1010 Solidigm D5-P5336
8 4 2 1 8 4 2 1
BERT 7.65% 9.46% 8.03% 8.95% 5.70% 9.91% 72.50% 5.90%
DLRMv2 5.13% 7.32% 4.32% 6.91% 5.46% 5.38% 3.02% 3.71%
GNN 4.50% 3.98% 5.26% 3.69% 4.20% 6.77% 34.50% 4.14%
LoRA 6.17% 4.27% 8.17% 8.58% 6.55% 6.19% 5.48% 6.33%
SD 13.86% 11.21% 11.75% 15.39% 11.18% 11.93% 13.54% 11.40%
SSD 0.07% 10.65% 0.12% 0.22% 0.17% 0.16% 0.08% 0.04%

Table 3. Relative standard deviation for each AI model workload in MLPerf training

Conversely, Figure 10 demonstrates that the D5-P5336 system benefits significantly from increased disk count, particularly for DLRMv2, which reaches a maximum speed-up of 2.51x with 8 disks. GNN also exhibits steady improvements although to a lesser extent. Other models display minor variations, with BERT experiencing slight performance degradation as the number of disks increases. These results suggest that the D5-P5336 system is more reliant on disk count for performance gains, particularly for data-intensive workloads such as DLRMv2.

DLRMv2 is highly sensitive to SSD performance, prompting further tests specifically for this model. The D74H-7U hardware architecture supports NVIDIA GDS (GPUDirect Storage, as shown in Figure 11), a key feature for AI training acceleration. GDS enables direct data transfers between NVMe SSDs and GPUs, bypassing system memory and reducing CPU involvement. This optimization minimizes data transfer latency and maximizes throughput, particularly benefiting workloads that require high-speed data access. As a result, all tests on the D74H-7U were conducted with GDS enabled. Since the D74H-7U only supports NVMe SSDs, training tests for the D3-S4520 were conducted exclusively on the D54U-3U.

Illustration of NVIDIA GPUDirect Storage Figure 11. Illustration of NVIDIA GPUDirect Storage

Figure 12 provides a comparative analysis of MLPerf training performance for DLRMv2 using D7-PS1010 at D74H-7U and D54U-3U. At a single disk, both systems perform similarly, but as the number of disks increases, D7-PS1010 on D74H-7U demonstrates noticeable improvements, achieving a maximum speed-up of 1.29x at 8 disks. In contrast, D54U-3U remains close to 1.00x, suggesting that D74H-7U benefits more from disk scaling with GDS enabled, while D54U-3U faces architectural limitations in handling I/O scaling.

Compare MLPerf Training DLRMv2 with PS1010 on D74H-7U and D54U-3U Figure 12. Compare MLPerf Training DLRMv2 with PS1010 on D74H-7U and D54U-3U

Figure 13 shows that the D7-PS1010 consistently outperforms the D5-P5336, particularly with fewer devices. With a single disk, the training time on D7-PS1010 is 5.04 minutes, whereas on D5-P5336, it is significantly higher at 9.78 minutes. At 4 disks, performance gains begin to stabilize, with D7-PS1010 at 4.14 minutes and D5-P5336 at 4.15 minutes. As the number of devices increases to 8, the performance gap fully converges, with D7-PS1010 at 3.92 minutes and D5-P5336 at 3.90 minutes. These results suggest that the higher bandwidth of PCIe Gen5 in D7-PS1010 provides a substantial advantage in less devices setup, but its impact diminishes as scaling reaches its efficiency limit.

DLRMv2 performance across disk configuration Figure 13. DLRMv2 performance across disk configuration

Figure 14 further examines MLPerf training performance for DLRMv2 on the D54U-3U system, comparing D7-PS1010, D5-P5336, and D3-S4520 SSDs. The results indicate that while D7-PS1010 and D5-P5336 maintain stable training times across different disk configurations, the D3-S4520 exhibits significant speed-up with increased disk count. Notably, at 8 disks, D3-S4520 achieves a 6.78x speed-up compared to a single-disk configuration, reducing training time from 123.29 minutes to 18.19 minutes. In contrast, D7-PS1010 and D5-P5336 remain within a narrow performance range, with training times fluctuating around 15 minutes regardless of disk count. These findings highlight the critical role of storage type in AI training performance, particularly for workloads highly sensitive to disk read and write speeds.

Compare MLPerf Training DLRMv2 with D7-PS1010, D5-P5336 and D3-S4520 on D54U-3U Figure 14. Compare MLPerf Training DLRMv2 with D7-PS1010, D5-P5336 and D3-S4520 on D54U-3U

MLPerf Storage v1.0 Disk Configuration Performance Analysis

The MLPerf Storage benchmark simulates AI training on a GPU, primarily testing the disk's read performance. The test results indicate that the performance of the SATA SSD (D3-S4520) is clearly insufficient, making NVMe the only viable option. For the single server used in this test, two D7-PS1010 drives have reached the maximum usable performance limit, while the D5-P5336 requires four drives to reach its limit. Regardless of whether it is the D7-PS1010 or D5-P5336, the read performance of a single disk is close to its theoretical specification limit in the ResNet50 and Cosmoflow AI workload.

Figure 15, Figure 16 and Table 4 provide a detailed comparison of MLPerf Storage performance using D7-PS1010 and D5-P5336 at D54U-3U, analyzing multiple AI models, including ResNet-50, Unet3D, and CosmoFlow. The results demonstrate that disk performance scales differently across workloads, emphasizing the importance of understanding workload-specific storage needs.

In Figure 15, D7-PS1010 exhibits strong performance across all tested workloads. Unet3D benefits significantly from increased disk count, achieving a peak throughput of 23176.57 MiB/s with 8 disks, compared to 11869.57 MiB/s with a single disk. ResNet-50 follows a similar trend, with throughput increasing from 15550.54 MiB/s (1 disk) to 20069.97 MiB/s (2 disks) but stabilizing beyond 4 disks. CosmoFlow, however, demonstrates minimal gains from adding more disks, with throughput fluctuating around 15838.27 MiB/s, suggesting its storage access patterns do not fully exploit additional NVMe devices.

MLPerf Storage with D7-PS1010 on D54U-3U Figure 15. MLPerf Storage with D7-PS1010 on D54U-3U

Figure 16 presents the results for D5-P5336, revealing a different scaling pattern. While Unet-3D maintains a strong scaling trend, reaching 23045.24 MiB/s with 8 disks, ResNet-50 sees more pronounced benefits compared to D7-PS1010, increasing from 8402.90 MiB/s (1 disk) to 19817.54 MiB/s (8 disks). CosmoFlow again exhibits limited scaling benefits, with throughput peaking at 15657.91 MiB/s with 8 disks. This suggests that for workloads like Unet3D and ResNet-50, P5336 provides competitive performance, albeit requiring more disks to reach peak efficiency.

MLPerf Storage with D5-P5336 on D54U-3 Figure 16. MLPerf Storage with D5-P5336 on D54U-3U

The Table 4 presents results for three AI models, ResNet50, UNet-3D, and CosmoFlow, evaluated using Solidigm SSDs (D7-PS1010, D5-P5336, and D3-S4520) across varying numbers of devices (1, 2, 4, 8) on the D54U-3U platform. The table reports dataset sizes, Accelerator Utilization (AU), throughput (MiB/s), and the number of simulated accelerators, optimized through trial and error to achieve peak performance.

AI Model # of Devices Solidigm D7-PS1010 Solidigm D5-P5336 Solidigm D3-S4520
8 4 2 1 8 4 2 1 8 4 2 1
ResNet50 # Simulated H100 Accelerators 111 111 111 86 112 112 86 47 28 14 6 2
Dataset Size (GiB) 5030 5030 5030 5030 5030 5030 5030 5030 5030 5030 5030 639
AU_1 90.30 91.87 92.77 92.83 90.93 93.29 90.42 91.58 91.16 91.34 98.78 95.37
AU_2 90.26 91.72 92.65 92.69 90.07 93.16 90.51 91.70 91.22 91.29 98.79 95.25
AU_3 90.80 91.76 92.92 92.72 90.89 93.01 90.33 91.45 91.36 91.44 98.75 95.31
AU_4 90.36 91.17 92.32 92.59 90.88 92.48 90.39 91.70 91.34 91.31 98.79 95.23
AU_5 90.59 91.71 92.52 92.35 90.50 93.26 90.52 91.47 91.27 91.43 98.80 95.16
Throughput (MiB/s) 19598.80 19855.28 20069.97 15550.54 19817.54 20337.54 15181.02 8402.90 4989.15 2497.28 1157.25 371.96
Unet3D # Simulated H100 Accelerators 8 8 7 4 8 8 4 2 1 1 1 1
Dataset Size (GiB) 5030 5030 5030 5030 5030 5030 5030 5030 5030 5030 639 639
AU_1 96.29 95.67 90.98 98.66 97.34 96.55 98.66 98.72 98.80 67.83 29.85 11.58
AU_2 96.58 95.80 91.87 98.68 97.75 97.40 98.65 98.70 98.77 67.85 30.05 11.59
AU_3 96.78 94.87 92.06 98.68 97.17 98.26 98.66 98.73 98.79 67.86 30.11 11.60
AU_4 96.45 94.44 90.95 98.69 96.50 97.82 98.68 98.72 98.80 67.71 30.14 11.58
AU_5 96.57 95.69 91.01 98.68 96.11 97.78 98.66 98.73 98.79 67.74 30.12 11.59
Throughput (MiB/s) 23176.57 22877.37 19216.97 11869.57 23045.24 23143.96 11864.99 5938.05 2976.95 INVALID INVALID INVALID
Cosmoflow # Simulated H100 Accelerators 28 28 28 28 28 28 28 14 7 3 2 1
Dataset Size (GiB) 5030 5030 5030 5030 5030 5030 5030 5030 5030 5030 5030 639
AU_1 72.06 72.00 73.18 73.56 72.95 73.48 70.44 76.75 72.64 87.19 72.85 64.07
AU_2 71.70 71.98 73.26 73.67 72.90 73.86 70.57 76.80 72.85 87.43 72.85 64.74
AU_3 72.02 71.99 73.28 73.57 72.78 73.61 70.48 76.75 72.96 87.57 72.47 65.04
AU_4 71.77 72.08 73.27 73.70 72.68 73.97 70.62 76.81 73.16 87.74 72.84 65.05
AU_5 71.89 72.28 73.38 73.74 72.72 73.57 70.45 76.80 73.24 87.94 72.72 64.81
Throughput (MiB/s) 15461.26 15499.75 15757.37 15838.27 15657.91 15848.42 15165.49 8270.50 3933.09 2023.81 1121.04 INVALID

Table 4. Results of MLPerf Storage on different Solidigm NVME models

For ResNet50, AU exceeds 90% across all configurations, meeting its criterion, with throughput peaking at 20,069.97 MiB/s (D7-PS1010, 8 devices) and 19,817.54 MiB/s (D5-P5336, 8 devices), while D3-S4520 reaches only 4,989.15 MiB/s, indicating its inadequacy for high-throughput demands. UNet-3D also consistently achieves AU above 90%, with D7-PS1010 and D5-P5336 delivering exceptional throughputs of 23,176.57 MiB/s and 23,045.24 MiB/s at 8 devices, respectively, underscoring NVMe’s superiority over D3-S4520, which shows invalid results for lower device counts. CosmoFlow, with an AU criterion of 70%, maintains AU values above this threshold, but its throughput shows minimal scaling (e.g., 15,838.27 MiB/s for D7-PS1010 and 15,657.91 MiB/s for D5-P5336 at 8 devices), reflecting the workload’s inherent characteristics. This suggests that CosmoFlow’s data access patterns and computational demands are less sensitive to storage scaling, prioritizing other system factors like compute and memory efficiency.

The simulated accelerators, optimized through iterative testing, reflect the best configuration for each setup, balancing AU and throughput. Solidigm D7-PS1010 generally outperforms D5-P5336 with fewer devices due to its PCIe Gen5 bandwidth, while D5-P5336 requires more scaling to match performance. D3-S4520 consistently underperforms, reinforcing NVMe’s necessity for AI workloads. These results highlight the importance of workload-specific storage planning, with NVMe SSDs critical for high-throughput models like UNet-3D, while CosmoFlow’s stability indicates less reliance on storage scaling.

Overall, the data confirms that NVMe SSDs are necessary for AI workloads, particularly for models like Unet-3D that demand high throughput. D7-PS1010 reaches its peak performance with fewer disks, while D5-P5336 requires additional scaling to match performance levels. D3-S4520 are unsuitable for these tasks, emphasizing the need for careful storage selection in AI infrastructure planning.

Hardware Configuration Recommendations for AI Model Training

Based on the analysis, the following recommendations are provided for optimizing AI model training:

  1. Storage Selection: For workloads that are highly sensitive to disk read and write speeds, selecting high-performance NVMe SSDs, such as the Solidigm D7-PS1010 (PCIe Gen5), is crucial. For cost-sensitive deployments, storage scaling with multiple SSDs may help mitigate performance bottlenecks.
  2. Utilization of NVIDIA GDS: Enabling NVIDIA GPUDirect Storage (GDS) is recommended for accelerating AI training workloads, especially in environments where direct data transfers between NVMe SSDs and GPUs can reduce CPU overhead and memory bottlenecks.
  3. Balanced System Architecture: To achieve optimal performance, storage, CPU, and GPU configurations must be balanced. A system with high PCIe bandwidth and efficient data flow mechanisms will generally yield better results for AI training.

By following these recommendations, AI model training can be optimized for performance and efficiency, ensuring that hardware investments are effectively utilized to meet computational demands.

Conclusion

The benchmarking results from MLPerf v4.1 highlight the critical role of high-performance storage in AI workloads, particularly in training scenarios. The evaluation of Solidigm NVMe SSDs—D7-PS1010 (PCIe Gen5), D5-P5336 (PCIe Gen4), and D3-S4520 (SATA)—demonstrates that while inference performance remains largely unaffected by storage configurations, AI training and storage-intensive workloads benefit significantly from high-speed NVMe solutions.

For inference workloads, the results confirm that disk count has no measurable impact, as models are preloaded into memory before execution. Consequently, optimizing GPU performance and memory bandwidth is more critical for improving inference efficiency than scaling storage solutions.

In contrast, training workloads, particularly for storage-intensive models such as DLRMv2, exhibit clear performance improvements when using high-speed NVMe SSDs. The Solidigm D7-PS1010 PCIe Gen5 SSD consistently delivers better training times compared to the D5-P5336 PCIe Gen4 SSD, particularly in low-disk configurations. However, as the number of disks increases, performance gains begin to plateau, indicating a threshold beyond which additional storage scaling offers diminishing returns.

The MLPerf Storage benchmark results further emphasize the necessity of NVMe SSDs for AI applications. The performance of the SATA SSD (Solidigm D3-S4520) is insufficient for modern AI workloads, making NVMe storage the preferred option. While D7-PS1010 achieves peak efficiency with fewer disks, D5-P5336 requires additional scaling to match performance levels, underscoring the importance of workload-specific storage planning.

Overall, these findings highlight that AI infrastructure optimization requires a balanced approach, where training workloads demand high-performance NVMe SSDs, while inference workloads benefit more from GPU and memory enhancements. Organizations aiming to scale AI deployments should prioritize storage solutions based on workload requirements, ensuring an optimal balance between compute power, memory bandwidth, and storage performance to maximize efficiency and scalability.


About the Authors

References

[1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” May 24, 2019, arXiv: arXiv:1810.04805. doi: 10.48550/arXiv.1810.04805.

[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” Dec. 10, 2015, arXiv: arXiv:1512.03385. doi: 10.48550/arXiv.1512.03385.

[3] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss for Dense Object Detection,” Feb. 07, 2018, arXiv: arXiv:1708.02002. doi: 10.48550/arXiv.1708.02002.

[4] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation,” Jun. 21, 2016, arXiv: arXiv:1606.06650. doi: 10.48550/arXiv.1606.06650.

[5] M. Naumov et al., “Deep Learning Recommendation Model for Personalization and Recommendation Systems,” May 31, 2019, arXiv: arXiv:1906.00091. doi: 10.48550/arXiv.1906.00091.

[6] B. Wang and A. Komatsuzaki, “GPT-J-6B: A 6 billion parameter autoregressive language model.” May 2021. [Online]. Available: https://github.com/kingoflolz/mesh-transformer-jax

[7] H. Touvron et al., “Llama 2: Open Foundation and Fine-Tuned Chat Models,” Jul. 19, 2023, arXiv: arXiv:2307.09288. doi: 10.48550/arXiv.2307.09288.

[8] A. Q. Jiang et al., “Mixtral of Experts,” Jan. 08, 2024, arXiv: arXiv:2401.04088. doi: 10.48550/arXiv.2401.04088.

[9] D. Podell et al., “SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis,” Jul. 04, 2023, arXiv: arXiv:2307.01952. doi: 10.48550/arXiv.2307.01952.

[10] M. Chen, Y. Zhang, X. Kou, Y. Li, and Y. Zhang, “r-GAT: Relational Graph Attention Network for Multi-Relational Graphs,” Sep. 13, 2021, arXiv: arXiv:2109.05922. doi: 10.48550/arXiv.2109.05922.

[11] E. J. Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,” Oct. 16, 2021, arXiv: arXiv:2106.09685. doi: 10.48550/arXiv.2106.09685.

[12] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-Resolution Image Synthesis with Latent Diffusion Models,” Apr. 13, 2022, arXiv: arXiv:2112.10752. doi: 10.48550/arXiv.2112.10752.

[13] W. Liu et al., “SSD: Single Shot MultiBox Detector,” vol. 9905, 2016, pp. 21–37. doi: 10.1007/978-3-319-46448-0_2.

[14] H. Devarajan, H. Zheng, A. Kougkas, X.-H. Sun, and V. Vishwanath, “DLIO: A Data-Centric Benchmark for Scientific Deep Learning Applications,” in 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), May 2021, pp. 81–91. doi: 10.1109/CCGrid51090.2021.00018.

[15] J. Zhou et al., “Graph Neural Networks: A Review of Methods and Applications,” Oct. 06, 2021, arXiv: arXiv:1812.08434. doi: 10.48550/arXiv.1812.08434.

Disclaimers

©2025, Solidigm. “Solidigm” is a registered trademark of SK hynix NAND Product Solutions Corp (d/b/a Solidigm) in the United States, People’s Republic of China, Singapore, Japan, the European Union, the United Kingdom, Mexico, and other countries.

Other names and brands may be claimed as the property of others. 

Solidigm may make changes to specifications and product descriptions at any time, without notice. 

Tests document the performance of components on a particular test, in specific systems. 

Differences in hardware, software, or configuration will affect actual performance.

Consult other sources of information to evaluate performance as you consider your purchase. 

These results are preliminary and provided for information purposes only. These values and claims are neither final nor official. 

Drives are considered engineering samples. Refer to roadmap for production guidance.