Table of Contents
The evolution of artificial intelligence (AI) workloads has amplified the demand for efficient storage and compute solutions to optimize performance across training and inference tasks. This study leverages MLPerf benchmarks—Inference v4.1, Training v4.1, and Storage v1.0—to evaluate the impact of Solidigm SSDs, specifically the D7-PS1010 (PCIe Gen5), D5-P5336 (PCIe Gen4), and D3-S4520 (SATA), on AI efficiency. Results reveal that inference performance remains largely unaffected by disk configurations, as it depends primarily on GPU capabilities and memory bandwidth, with no significant gains from additional SSDs. In contrast, training workloads, particularly data-intensive models like DLRMv2, exhibit substantial performance improvements with high-speed NVMe SSDs, where the D7-PS1010 outperforms the D5-P5336 in configurations with fewer disks, although gains plateau with scaling. The MLPerf Storage benchmark further confirms NVMe’s superiority over SATA, with D7-PS1010 achieving peak throughput with fewer disks compared to D5-P5336, while D3-S4520 proves inadequate for modern AI demands. These findings underscore the need for tailored storage strategies, with high-performance NVMe for training and a focus on compute optimization for inference, highlighting the critical role of infrastructure balance in maximizing AI system efficiency.
The increasing complexity of artificial intelligence (AI) workloads has placed unprecedented demands on system performance, necessitating a nuanced understanding of how storage and compute components influence efficiency. The MLPerf benchmarking suites—Inference, Training, and Storage—provide a standardized framework to assess AI system performance across diverse hardware configurations, offering critical insights into optimizing these workloads.
MLPerf Inference evaluates real-time prediction tasks, where efficiency hinges on model execution speed, typically within memory, rendering disk performance a secondary factor. Conversely, MLPerf Training examines the process of building AI models from scratch, a phase heavily reliant on storage throughput due to extensive data access requirements, especially for tasks like recommendation systems and image processing. Complementing these, the MLPerf Storage benchmark isolates storage performance under AI-specific data pipelines, addressing the growing need for scalable, high-throughput solutions in data-intensive applications.
This study investigates the interplay between storage configurations and AI performance using Solidigm NVMe SSDs—D7-PS1010, D5-P5336, and D3-S4520—across two server platforms, QuantaGrid D74H-7U and D54U-3U. Findings indicate that inference workloads are compute- and memory-bound, showing negligible benefits from enhanced storage setups, while training and storage benchmarks reveal significant advantages with NVMe SSDs, particularly for models like DLRMv2 that demand rapid data retrieval. By analyzing these outcomes, this study underscores the pivotal role of high-performance storage in training scenarios and the primacy of GPU and memory optimization in inference, providing actionable guidance for designing efficient AI infrastructures. These insights aim to inform stakeholders in academia and industry on aligning hardware choices with workload-specific needs to achieve scalability and sustained performance.
To evaluate the impact of storage configurations on AI workloads, MLPerf inference, training, and storage benchmarks were conducted on two server platforms.
System | QuantaGrid D74H-7U | QuantaGrid D54U-3U |
CPU | Intel Xeon Platinum 8480+ 56 cores x 2 |
Intel Xeon Platinum 8470 52 core x 2 |
RAM | 2 TB (DDR5-4800 64GB x 32) | 2TB (DDR5-4800 64GB x 32) |
OS Disk | Samsung PM9A3 3.84TB x 1 | Samsung PM9A3 1.92TB x 1 |
Data Disk | Solodigm D7-PS1010 U.2 7.68TB x 8 S olodigm D5-P5336 U.2 15.36TB x 8 |
S olodigm D7-PS1010 U.2 7.68TB x 8 S olodigm D5-P5336 U.2 15.36TB x 8 S olodigm D3-S4520 SATA 7.68TB x 8 |
Accelerator | H100 SXM5 80GB x 8 | H100 PCIe 80GB x 4 |
BIOS Configuration | Profile: Performance Enable LP [Global]: ALL LPs SNC: Disable |
|
OS | Rocky Linux release 9.2 (Blue Onyx) | |
Kernel | 5.14.0-362.18.1.el9_3.x86_64 | 5.14.0-284.11.1.el9_2.x86_64 |
Framework | GPU Driver 550.127.08 CUDA 12.4 + GDS 12.4 |
GPU Driver 550.90.07 CUDA 12.4 |
Table 1: The configuration of QuantaGrid D74H-7U and D54U-3U with SOLIDIGM’s different NVMe solutions.
Solidigm™ D7-PS1010 (Form Factor: U.2) | Solidigm™ D5-P5336 (Form Factor: U.2) | Solidigm™ D3-S4520 (Form Factor: U.2) | |
Image | Image | Image | |
Capacity | 7.68 TB | 15.36TB | 7.68 TB |
Lithography Type | 176L TLC 3D NAND | 192L QLC 3D NAND | 144L TLC 3D NAND |
Interface | PCIe 5.0 x4, NVMe | PCIe 4.0 x4, NVMe | SATA 3.0 6Gb/s |
Sequential Read (up to) | 14,500 MB/s | 7,000 MB/s | 550 MB/s |
Sequential Write (up to) | 9,300 MB/s | 3,000 MB/s | 510 MB/s |
Random Read (up to) | 2,800,000 IOPS (4K) | 1,005,000 IOPS (4K) | 86,000 IOPS (4K) |
Random Write (up to) | 400,000 IOPS (4K) | 24,000 IOPS (4K) | 30,000 IOPS (4K) |
Table 2: Specification of SOLIDIGM PS1010, P5336 and S4520
MLPerf inference testing was performed on both platforms, with each system evaluated in "Server" and "Offline" modes to simulate real-world AI inference environments. The benchmarks assessed performance across different storage configurations using 1, 2, 4, and 8 drives to analyze scalability and throughput efficiency. Given that inference workloads primarily rely on memory and GPU performance, the objective was to determine whether increased disk count had any measurable impact on performance.
For MLPerf training and storage benchmarks, both QuantaGrid D74H-7U and D54U-3U were utilized for training workloads, while D54U-3U was also used for storage performance evaluation. Training tests examined the relationship between storage configurations and AI model performance. Storage benchmarks analyzed disk throughput and efficiency under AI-specific workloads to assess the benefits of NVMe SSDs over SATA alternatives.
In configurations utilizing 2, 4, and 8 drives, a software RAID0 setup was implemented to optimize read and write speeds, ensuring efficient data distribution across SSDs. To fully leverage SSD performance, all NVMe SSDs were directly connected via the CPU’s PCIe lanes or through a PCIe switch. Hardware RAID was avoided to prevent potential bandwidth limitations imposed by RAID controllers, ensuring that AI workloads could maximize storage throughput without PCIe lane constraints.
This section outlines the MLPerf benchmarking suites—Inference v4.1, Training v4.1, and Storage v1.0—developed by the MLCommons Association to evaluate AI system performance across inference, training, and storage workloads. These suites provide standardized, reproducible methodologies to assess hardware and software efficiency, ensuring fairness, comparability, and flexibility through distinct divisions and rules.
MLPerf Inference v4.1 is designed to measure the performance of AI systems during real-time inference tasks, focusing on execution speed, latency, and accuracy. It evaluates a diverse set of workloads, including BERT [1], ResNet-50 [2], RetinaNet [3], 3D-Unet [4], DLRMv2 [5], GPT-J [6], Llama2-70B [7], Mixtral-8x7B [8], Stable Diffusion XL (SDXL) [9], using standardized system configurations and frameworks. The suite quantifies key metrics such as latency, throughput, and efficiency while ensuring that model accuracy meets predefined standards, supporting platforms ranging from low-power edge devices to high-performance data center servers. It promotes openness and comparability across vision, language, commerce, generative, and graph domains, addressing diverse deployment contexts.
Key Definitions
In MLPerf Inference, critical terms include:
Testing Scenarios
MLPerf Inference incorporates four distinct testing scenarios to mirror real-world inference workloads, as detailed in the following table:
Scenario | Purpose | Use Case | Metric |
Single Stream | Assesses latency for a single query stream | Real-time applications like voice recognition or live video analysis | Time to process each query |
Multi-Stream | Tests performance across multiple simultaneous streams | Multi-user systems, such as video streaming or chatbots | Latency and throughput with concurrent queries |
Server | Evaluates handling of dynamic, online query loads | Cloud inference services with fluctuating demand | Queries per second (QPS) under latency constraints |
Offline | Measures throughput for large batch processing | Bulk tasks like dataset analysis or media indexing | Total queries processed in a set timeframe |
Submission Divisions
MLPerf Inference features two divisions: Closed and Open. The Closed Division mandates equivalence to the reference or alternative implementation, allowing calibration for quantization but prohibiting retraining. The Open Division permits arbitrary pre-/post-processing and models, including retraining, with reported accuracy and latency constraints, fostering innovation but sacrificing comparability.
Methodology and Workflow
The MLPerf Inference process relies on the Load Generator (LoadGen), a C++ tool with Python bindings, to simulate queries, track latency, validate accuracy, and compute metrics. LoadGen operates on the processor simulating queries from logical sources and stores traces in DRAM, adhering to bandwidth requirements. Figure 1 illustrates the simplified workflow of MLPerf Inference from configuration to the validation stage, incorporating a failure scenario where validation does not pass, requiring reconfiguration of the System Under Test (SUT).
These scenarios ensure comprehensive coverage of latency-critical and throughput-focused applications, with early stopping criteria allowing shorter runs while maintaining statistical validity.
Rules and Guidelines
Rules ensure fairness, requiring consistent systems and frameworks, open-sourcing code, restricting non-determinism to fixed seeds, and prohibiting benchmark detection or input-based optimization. Replicability is mandatory, with audits validating compliance, particularly for Closed submissions.
Use Cases and Impact
MLPerf Inference supports edge computing, cloud infrastructure, and specialized domains, optimizing real-time inference and scalability, driving efficient AI solution development.
MLPerf Training v4.1 establishes standardized benchmarks to measure training performance, defined as execution speed, across diverse ML tasks. It evaluates workloads like BERT, DLRMv2, GNN (R-GAT) [10], Low-Rank Adaptation (LoRA) [11], Stable Diffusion (SD) [12], and Single-Shot Detector (SSD) [13] ensuring fairness through defined rules. Performance and quality are key metrics, with results eligible for the MLPerf trademark upon compliance. The suite encompasses systems, frameworks, benchmarks, and runs normalized against reference results.
Key Definitions
Key terms include:
Benchmarks and Divisions
The suite spans vision, language, commerce, and graph domains, with Closed and Open divisions. Closed mandates reference preprocessing, models, and targets, ensuring comparability, while Open allows flexibility in data and methods, requiring iterative improvement and benchmark dataset alignment.
Methodology and Workflow
Training adheres to reference models, weights, optimizers, and hyperparameters, with random number generation (Closed: stock, clock-seeded via mllog) and numerical formats (Closed: pre-approved, e.g., fp32, fp16) restricted. Data handling ensures reference consistency, with quality evaluated at specified frequencies, results derived from multiple runs and normalized against references. Figure 2 illustrates the simplified workflow of MLPerf Training from system definition to the convergence check stage, incorporating a failure scenario where convergence does not meet Reference Convergence Points (RCPs), requiring adjustments to hyperparameters or re-running the process.
Rules and Guidelines
Fairness is paramount, prohibiting benchmark detection, pre-training (except metadata), and requiring replicability. Reference Convergence Points (RCPs) ensure submission convergence aligns with reference, with audits and hyperparameter borrowing allowed to optimize performance.
Use Cases and Impact
MLPerf Training supports AI model development in vision, language, and commerce, optimizing training for data centers, enhancing scalability, and driving hardware/software innovation.
MLPerf Storage v1.0 evaluates storage system performance for ML workloads, emulating accelerator demands via sleep intervals to isolate data-ingestion pipelines, enabling scalable testing without compute clusters. It focuses on storage scalability and performance, supporting workloads like 3D U-Net, ResNet-50, and CosmoFlow.
Key Definitions
Key terms include:
Benchmarks and Divisions
The suite simulates I/O patterns from MLPerf Training/HPC, measuring samples/second with minimum AU thresholds (e.g., 90% for ResNet-50). Closed standardizes parameters for comparability, restricting modifications, while Open allows customizations (e.g., DLIO changes) for innovation, requiring documentation.
Methodology and Workflow
MLPerf Storage uses DLIO to generate synthetic datasets (≥5x host DRAM, avoiding caching), calculating minimum sizes based on accelerators, memory, and steps. Single-host or distributed setups scale load, synchronizing via barriers, measuring performance as samples/second. Figure 3 illustrates the simplified workflow of MLPerf Storage from configuration to the validation of Accelerator Utilization (AU) threshold stage, incorporating a failure scenario where the AU threshold is not met, requiring adjustments to the storage system or configuration.
Rules and Guidelines
Rules ensure fairness, requiring available systems (commercial within 6 months), fixed seeds, stable storage, no preloading, cleared caches, and 5% replicability over five runs. Audits verify compliance, with Closed submissions using provided scripts, Open allowing DLIO modifications.
Use Cases and Impact
MLPerf Storage optimizes storage for ML training, supporting large-scale data pipelines in vision, scientific computing, and more, guiding infrastructure planning for scalability and efficiency.
The performance evaluation of MLPerf Inference v4.1 using Solidigm D7-PS1010 (PCIe Gen5), D5-P5336 (PCIe Gen4), and D3-S4520 (SATA SSDs) across different RAID0 disk configurations indicates that increasing the number of disks has negligible impact on inference performance. Speed-up values remain nearly constant across all tested models, including ResNet50, RetinaNet, BERT, DLRMv2, 3D-Unet, SDXL, GPT-J, Llama2-70b, and Mixtral.
From the provided Figure 4 to Figure 8, we observe that for D7-PS1010, D5-P5336, and D3-S4520, inference speed-up remains unchanged across different disk configurations. This suggests that MLPerf inference workloads are largely compute- and memory-bound rather than I/O-bound. Since inference primarily involves model execution in memory with minimal disk access, adding more storage devices does not provide measurable performance gains.
Additionally, across the D74H-7U and D54U-3U platforms, the trend remains consistent, with no significant variation in speed-up observed between different SSD models or disk configurations. This further reinforces that MLPerf inference does not rely on disk I/O for performance improvements, and disk choice plays a minor role in overall system efficiency.
A notable case is Mixtral, a newly added model in MLPerf Inference v4.1, which has been optimized and quantized by NVIDIA for high-performance GPUs such as H100 and H200 SXM5. However, Mixtral does not fully support the H100 PCIe 80GB on the D54U-3U platform, leading to its omission from tests.
These findings emphasize that for AI inference tasks, investing in additional high-speed SSDs may not yield significant benefits, and efforts should instead focus on optimizing compute acceleration and memory efficiency.
Figure 9 and Figure 10 provide a comparative analysis of MLPerf training speed-up performance across varying storage configurations, utilizing the D7-PS1010 and D5-P5336 systems at D74H-7U. These visual representations highlight the scalability characteristics of multiple machine learning models, including BERT, DLRMv2, GNN, LoRA, Stable Diffusion (SD), and Single-Shot Detector (SSD), under different disk configurations (1, 2, 4, and 8 disks).
In Figure 9, DLRMv2 and GNN showing the most noticeable improvements as the number of disks increases. DLRMv2 achieves a peak speed-up of 1.29x with 8 disks, while GNN reaches 1.10x. Other models display only marginal changes, suggesting limited disk I/O dependency.
Table 3 presents the relative standard deviation (RSD) for each model under different storage configurations. The RSD values indicate significant variability in each model, suggesting that training performance is influenced by factors beyond disk I/O. This variability is further compounded by the impact of random seed selection, which affects training convergence and computational efficiency across multiple runs.
AI Model / # of Device |
Solidigm D7-PS1010 | Solidigm D5-P5336 | ||||||
8 | 4 | 2 | 1 | 8 | 4 | 2 | 1 | |
BERT | 7.65% | 9.46% | 8.03% | 8.95% | 5.70% | 9.91% | 72.50% | 5.90% |
DLRMv2 | 5.13% | 7.32% | 4.32% | 6.91% | 5.46% | 5.38% | 3.02% | 3.71% |
GNN | 4.50% | 3.98% | 5.26% | 3.69% | 4.20% | 6.77% | 34.50% | 4.14% |
LoRA | 6.17% | 4.27% | 8.17% | 8.58% | 6.55% | 6.19% | 5.48% | 6.33% |
SD | 13.86% | 11.21% | 11.75% | 15.39% | 11.18% | 11.93% | 13.54% | 11.40% |
SSD | 0.07% | 10.65% | 0.12% | 0.22% | 0.17% | 0.16% | 0.08% | 0.04% |
Table 3. Relative standard deviation for each AI model workload in MLPerf training
Conversely, Figure 10 demonstrates that the D5-P5336 system benefits significantly from increased disk count, particularly for DLRMv2, which reaches a maximum speed-up of 2.51x with 8 disks. GNN also exhibits steady improvements although to a lesser extent. Other models display minor variations, with BERT experiencing slight performance degradation as the number of disks increases. These results suggest that the D5-P5336 system is more reliant on disk count for performance gains, particularly for data-intensive workloads such as DLRMv2.
DLRMv2 is highly sensitive to SSD performance, prompting further tests specifically for this model. The D74H-7U hardware architecture supports NVIDIA GDS (GPUDirect Storage, as shown in Figure 11), a key feature for AI training acceleration. GDS enables direct data transfers between NVMe SSDs and GPUs, bypassing system memory and reducing CPU involvement. This optimization minimizes data transfer latency and maximizes throughput, particularly benefiting workloads that require high-speed data access. As a result, all tests on the D74H-7U were conducted with GDS enabled. Since the D74H-7U only supports NVMe SSDs, training tests for the D3-S4520 were conducted exclusively on the D54U-3U.
Figure 12 provides a comparative analysis of MLPerf training performance for DLRMv2 using D7-PS1010 at D74H-7U and D54U-3U. At a single disk, both systems perform similarly, but as the number of disks increases, D7-PS1010 on D74H-7U demonstrates noticeable improvements, achieving a maximum speed-up of 1.29x at 8 disks. In contrast, D54U-3U remains close to 1.00x, suggesting that D74H-7U benefits more from disk scaling with GDS enabled, while D54U-3U faces architectural limitations in handling I/O scaling.
Figure 13 shows that the D7-PS1010 consistently outperforms the D5-P5336, particularly with fewer devices. With a single disk, the training time on D7-PS1010 is 5.04 minutes, whereas on D5-P5336, it is significantly higher at 9.78 minutes. At 4 disks, performance gains begin to stabilize, with D7-PS1010 at 4.14 minutes and D5-P5336 at 4.15 minutes. As the number of devices increases to 8, the performance gap fully converges, with D7-PS1010 at 3.92 minutes and D5-P5336 at 3.90 minutes. These results suggest that the higher bandwidth of PCIe Gen5 in D7-PS1010 provides a substantial advantage in less devices setup, but its impact diminishes as scaling reaches its efficiency limit.
Figure 14 further examines MLPerf training performance for DLRMv2 on the D54U-3U system, comparing D7-PS1010, D5-P5336, and D3-S4520 SSDs. The results indicate that while D7-PS1010 and D5-P5336 maintain stable training times across different disk configurations, the D3-S4520 exhibits significant speed-up with increased disk count. Notably, at 8 disks, D3-S4520 achieves a 6.78x speed-up compared to a single-disk configuration, reducing training time from 123.29 minutes to 18.19 minutes. In contrast, D7-PS1010 and D5-P5336 remain within a narrow performance range, with training times fluctuating around 15 minutes regardless of disk count. These findings highlight the critical role of storage type in AI training performance, particularly for workloads highly sensitive to disk read and write speeds.
The MLPerf Storage benchmark simulates AI training on a GPU, primarily testing the disk's read performance. The test results indicate that the performance of the SATA SSD (D3-S4520) is clearly insufficient, making NVMe the only viable option. For the single server used in this test, two D7-PS1010 drives have reached the maximum usable performance limit, while the D5-P5336 requires four drives to reach its limit. Regardless of whether it is the D7-PS1010 or D5-P5336, the read performance of a single disk is close to its theoretical specification limit in the ResNet50 and Cosmoflow AI workload.
Figure 15, Figure 16 and Table 4 provide a detailed comparison of MLPerf Storage performance using D7-PS1010 and D5-P5336 at D54U-3U, analyzing multiple AI models, including ResNet-50, Unet3D, and CosmoFlow. The results demonstrate that disk performance scales differently across workloads, emphasizing the importance of understanding workload-specific storage needs.
In Figure 15, D7-PS1010 exhibits strong performance across all tested workloads. Unet3D benefits significantly from increased disk count, achieving a peak throughput of 23176.57 MiB/s with 8 disks, compared to 11869.57 MiB/s with a single disk. ResNet-50 follows a similar trend, with throughput increasing from 15550.54 MiB/s (1 disk) to 20069.97 MiB/s (2 disks) but stabilizing beyond 4 disks. CosmoFlow, however, demonstrates minimal gains from adding more disks, with throughput fluctuating around 15838.27 MiB/s, suggesting its storage access patterns do not fully exploit additional NVMe devices.
Figure 16 presents the results for D5-P5336, revealing a different scaling pattern. While Unet-3D maintains a strong scaling trend, reaching 23045.24 MiB/s with 8 disks, ResNet-50 sees more pronounced benefits compared to D7-PS1010, increasing from 8402.90 MiB/s (1 disk) to 19817.54 MiB/s (8 disks). CosmoFlow again exhibits limited scaling benefits, with throughput peaking at 15657.91 MiB/s with 8 disks. This suggests that for workloads like Unet3D and ResNet-50, P5336 provides competitive performance, albeit requiring more disks to reach peak efficiency.
The Table 4 presents results for three AI models, ResNet50, UNet-3D, and CosmoFlow, evaluated using Solidigm SSDs (D7-PS1010, D5-P5336, and D3-S4520) across varying numbers of devices (1, 2, 4, 8) on the D54U-3U platform. The table reports dataset sizes, Accelerator Utilization (AU), throughput (MiB/s), and the number of simulated accelerators, optimized through trial and error to achieve peak performance.
AI Model | # of Devices | Solidigm D7-PS1010 | Solidigm D5-P5336 | Solidigm D3-S4520 | |||||||||
8 | 4 | 2 | 1 | 8 | 4 | 2 | 1 | 8 | 4 | 2 | 1 | ||
ResNet50 | # Simulated H100 Accelerators | 111 | 111 | 111 | 86 | 112 | 112 | 86 | 47 | 28 | 14 | 6 | 2 |
Dataset Size (GiB) | 5030 | 5030 | 5030 | 5030 | 5030 | 5030 | 5030 | 5030 | 5030 | 5030 | 5030 | 639 | |
AU_1 | 90.30 | 91.87 | 92.77 | 92.83 | 90.93 | 93.29 | 90.42 | 91.58 | 91.16 | 91.34 | 98.78 | 95.37 | |
AU_2 | 90.26 | 91.72 | 92.65 | 92.69 | 90.07 | 93.16 | 90.51 | 91.70 | 91.22 | 91.29 | 98.79 | 95.25 | |
AU_3 | 90.80 | 91.76 | 92.92 | 92.72 | 90.89 | 93.01 | 90.33 | 91.45 | 91.36 | 91.44 | 98.75 | 95.31 | |
AU_4 | 90.36 | 91.17 | 92.32 | 92.59 | 90.88 | 92.48 | 90.39 | 91.70 | 91.34 | 91.31 | 98.79 | 95.23 | |
AU_5 | 90.59 | 91.71 | 92.52 | 92.35 | 90.50 | 93.26 | 90.52 | 91.47 | 91.27 | 91.43 | 98.80 | 95.16 | |
Throughput (MiB/s) | 19598.80 | 19855.28 | 20069.97 | 15550.54 | 19817.54 | 20337.54 | 15181.02 | 8402.90 | 4989.15 | 2497.28 | 1157.25 | 371.96 | |
Unet3D | # Simulated H100 Accelerators | 8 | 8 | 7 | 4 | 8 | 8 | 4 | 2 | 1 | 1 | 1 | 1 |
Dataset Size (GiB) | 5030 | 5030 | 5030 | 5030 | 5030 | 5030 | 5030 | 5030 | 5030 | 5030 | 639 | 639 | |
AU_1 | 96.29 | 95.67 | 90.98 | 98.66 | 97.34 | 96.55 | 98.66 | 98.72 | 98.80 | 67.83 | 29.85 | 11.58 | |
AU_2 | 96.58 | 95.80 | 91.87 | 98.68 | 97.75 | 97.40 | 98.65 | 98.70 | 98.77 | 67.85 | 30.05 | 11.59 | |
AU_3 | 96.78 | 94.87 | 92.06 | 98.68 | 97.17 | 98.26 | 98.66 | 98.73 | 98.79 | 67.86 | 30.11 | 11.60 | |
AU_4 | 96.45 | 94.44 | 90.95 | 98.69 | 96.50 | 97.82 | 98.68 | 98.72 | 98.80 | 67.71 | 30.14 | 11.58 | |
AU_5 | 96.57 | 95.69 | 91.01 | 98.68 | 96.11 | 97.78 | 98.66 | 98.73 | 98.79 | 67.74 | 30.12 | 11.59 | |
Throughput (MiB/s) | 23176.57 | 22877.37 | 19216.97 | 11869.57 | 23045.24 | 23143.96 | 11864.99 | 5938.05 | 2976.95 | INVALID | INVALID | INVALID | |
Cosmoflow | # Simulated H100 Accelerators | 28 | 28 | 28 | 28 | 28 | 28 | 28 | 14 | 7 | 3 | 2 | 1 |
Dataset Size (GiB) | 5030 | 5030 | 5030 | 5030 | 5030 | 5030 | 5030 | 5030 | 5030 | 5030 | 5030 | 639 | |
AU_1 | 72.06 | 72.00 | 73.18 | 73.56 | 72.95 | 73.48 | 70.44 | 76.75 | 72.64 | 87.19 | 72.85 | 64.07 | |
AU_2 | 71.70 | 71.98 | 73.26 | 73.67 | 72.90 | 73.86 | 70.57 | 76.80 | 72.85 | 87.43 | 72.85 | 64.74 | |
AU_3 | 72.02 | 71.99 | 73.28 | 73.57 | 72.78 | 73.61 | 70.48 | 76.75 | 72.96 | 87.57 | 72.47 | 65.04 | |
AU_4 | 71.77 | 72.08 | 73.27 | 73.70 | 72.68 | 73.97 | 70.62 | 76.81 | 73.16 | 87.74 | 72.84 | 65.05 | |
AU_5 | 71.89 | 72.28 | 73.38 | 73.74 | 72.72 | 73.57 | 70.45 | 76.80 | 73.24 | 87.94 | 72.72 | 64.81 | |
Throughput (MiB/s) | 15461.26 | 15499.75 | 15757.37 | 15838.27 | 15657.91 | 15848.42 | 15165.49 | 8270.50 | 3933.09 | 2023.81 | 1121.04 | INVALID |
Table 4. Results of MLPerf Storage on different Solidigm NVME models
For ResNet50, AU exceeds 90% across all configurations, meeting its criterion, with throughput peaking at 20,069.97 MiB/s (D7-PS1010, 8 devices) and 19,817.54 MiB/s (D5-P5336, 8 devices), while D3-S4520 reaches only 4,989.15 MiB/s, indicating its inadequacy for high-throughput demands. UNet-3D also consistently achieves AU above 90%, with D7-PS1010 and D5-P5336 delivering exceptional throughputs of 23,176.57 MiB/s and 23,045.24 MiB/s at 8 devices, respectively, underscoring NVMe’s superiority over D3-S4520, which shows invalid results for lower device counts. CosmoFlow, with an AU criterion of 70%, maintains AU values above this threshold, but its throughput shows minimal scaling (e.g., 15,838.27 MiB/s for D7-PS1010 and 15,657.91 MiB/s for D5-P5336 at 8 devices), reflecting the workload’s inherent characteristics. This suggests that CosmoFlow’s data access patterns and computational demands are less sensitive to storage scaling, prioritizing other system factors like compute and memory efficiency.
The simulated accelerators, optimized through iterative testing, reflect the best configuration for each setup, balancing AU and throughput. Solidigm D7-PS1010 generally outperforms D5-P5336 with fewer devices due to its PCIe Gen5 bandwidth, while D5-P5336 requires more scaling to match performance. D3-S4520 consistently underperforms, reinforcing NVMe’s necessity for AI workloads. These results highlight the importance of workload-specific storage planning, with NVMe SSDs critical for high-throughput models like UNet-3D, while CosmoFlow’s stability indicates less reliance on storage scaling.
Overall, the data confirms that NVMe SSDs are necessary for AI workloads, particularly for models like Unet-3D that demand high throughput. D7-PS1010 reaches its peak performance with fewer disks, while D5-P5336 requires additional scaling to match performance levels. D3-S4520 are unsuitable for these tasks, emphasizing the need for careful storage selection in AI infrastructure planning.
Based on the analysis, the following recommendations are provided for optimizing AI model training:
By following these recommendations, AI model training can be optimized for performance and efficiency, ensuring that hardware investments are effectively utilized to meet computational demands.
The benchmarking results from MLPerf v4.1 highlight the critical role of high-performance storage in AI workloads, particularly in training scenarios. The evaluation of Solidigm NVMe SSDs—D7-PS1010 (PCIe Gen5), D5-P5336 (PCIe Gen4), and D3-S4520 (SATA)—demonstrates that while inference performance remains largely unaffected by storage configurations, AI training and storage-intensive workloads benefit significantly from high-speed NVMe solutions.
For inference workloads, the results confirm that disk count has no measurable impact, as models are preloaded into memory before execution. Consequently, optimizing GPU performance and memory bandwidth is more critical for improving inference efficiency than scaling storage solutions.
In contrast, training workloads, particularly for storage-intensive models such as DLRMv2, exhibit clear performance improvements when using high-speed NVMe SSDs. The Solidigm D7-PS1010 PCIe Gen5 SSD consistently delivers better training times compared to the D5-P5336 PCIe Gen4 SSD, particularly in low-disk configurations. However, as the number of disks increases, performance gains begin to plateau, indicating a threshold beyond which additional storage scaling offers diminishing returns.
The MLPerf Storage benchmark results further emphasize the necessity of NVMe SSDs for AI applications. The performance of the SATA SSD (Solidigm D3-S4520) is insufficient for modern AI workloads, making NVMe storage the preferred option. While D7-PS1010 achieves peak efficiency with fewer disks, D5-P5336 requires additional scaling to match performance levels, underscoring the importance of workload-specific storage planning.
Overall, these findings highlight that AI infrastructure optimization requires a balanced approach, where training workloads demand high-performance NVMe SSDs, while inference workloads benefit more from GPU and memory enhancements. Organizations aiming to scale AI deployments should prioritize storage solutions based on workload requirements, ensuring an optimal balance between compute power, memory bandwidth, and storage performance to maximize efficiency and scalability.
[1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” May 24, 2019, arXiv: arXiv:1810.04805. doi: 10.48550/arXiv.1810.04805.
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” Dec. 10, 2015, arXiv: arXiv:1512.03385. doi: 10.48550/arXiv.1512.03385.
[3] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss for Dense Object Detection,” Feb. 07, 2018, arXiv: arXiv:1708.02002. doi: 10.48550/arXiv.1708.02002.
[4] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation,” Jun. 21, 2016, arXiv: arXiv:1606.06650. doi: 10.48550/arXiv.1606.06650.
[5] M. Naumov et al., “Deep Learning Recommendation Model for Personalization and Recommendation Systems,” May 31, 2019, arXiv: arXiv:1906.00091. doi: 10.48550/arXiv.1906.00091.
[6] B. Wang and A. Komatsuzaki, “GPT-J-6B: A 6 billion parameter autoregressive language model.” May 2021. [Online]. Available: https://github.com/kingoflolz/mesh-transformer-jax
[7] H. Touvron et al., “Llama 2: Open Foundation and Fine-Tuned Chat Models,” Jul. 19, 2023, arXiv: arXiv:2307.09288. doi: 10.48550/arXiv.2307.09288.
[8] A. Q. Jiang et al., “Mixtral of Experts,” Jan. 08, 2024, arXiv: arXiv:2401.04088. doi: 10.48550/arXiv.2401.04088.
[9] D. Podell et al., “SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis,” Jul. 04, 2023, arXiv: arXiv:2307.01952. doi: 10.48550/arXiv.2307.01952.
[10] M. Chen, Y. Zhang, X. Kou, Y. Li, and Y. Zhang, “r-GAT: Relational Graph Attention Network for Multi-Relational Graphs,” Sep. 13, 2021, arXiv: arXiv:2109.05922. doi: 10.48550/arXiv.2109.05922.
[11] E. J. Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,” Oct. 16, 2021, arXiv: arXiv:2106.09685. doi: 10.48550/arXiv.2106.09685.
[12] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-Resolution Image Synthesis with Latent Diffusion Models,” Apr. 13, 2022, arXiv: arXiv:2112.10752. doi: 10.48550/arXiv.2112.10752.
[13] W. Liu et al., “SSD: Single Shot MultiBox Detector,” vol. 9905, 2016, pp. 21–37. doi: 10.1007/978-3-319-46448-0_2.
[14] H. Devarajan, H. Zheng, A. Kougkas, X.-H. Sun, and V. Vishwanath, “DLIO: A Data-Centric Benchmark for Scientific Deep Learning Applications,” in 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), May 2021, pp. 81–91. doi: 10.1109/CCGrid51090.2021.00018.
[15] J. Zhou et al., “Graph Neural Networks: A Review of Methods and Applications,” Oct. 06, 2021, arXiv: arXiv:1812.08434. doi: 10.48550/arXiv.1812.08434.
©2025, Solidigm. “Solidigm” is a registered trademark of SK hynix NAND Product Solutions Corp (d/b/a Solidigm) in the United States, People’s Republic of China, Singapore, Japan, the European Union, the United Kingdom, Mexico, and other countries.
Other names and brands may be claimed as the property of others.
Solidigm may make changes to specifications and product descriptions at any time, without notice.
Tests document the performance of components on a particular test, in specific systems.
Differences in hardware, software, or configuration will affect actual performance.
Consult other sources of information to evaluate performance as you consider your purchase.
These results are preliminary and provided for information purposes only. These values and claims are neither final nor official.
Drives are considered engineering samples. Refer to roadmap for production guidance.