Accelerating AI With High Performance Storage

Solidigm™ SSDs with NVIDIA® Magnum IO Architecture which includes GPUDirect® Storage and NVIDIA Bluefield®- 3 Data Processing Unit Driven NVMe emulation

Performance of Solidigm SSDs with NVIDIA GDS vs CPU-GPU data path

The GPU and storage bottleneck

Modern AI workloads demand unprecedented data throughput and low-latency access to massive datasets. Traditional storage architectures, reliant on CPU-facilitated data movement between NVMe SSDs and GPUs, struggle to keep pace with GPU compute capabilities. Data center SSDs, like the Solidigm™ D7-PS1010, deliver up to 14,500 MB/s sequential read speeds, but unlocking their full potential requires rethinking how GPUs interact with storage both locally and across distributed remote systems.

NVIDIA GPUDirect Storage (GDS)

NVIDIA GPUDirect Storage (GDS) eliminates the CPU bottleneck by enabling direct memory access (DMA) between GPUs and NVMe SSDs. As part of the NVIDIA Magnum IO SDK, GDS integrates with frameworks like CUDA to bypass CPU/RAM data staging, reducing latency, and freeing CPU resources for critical management tasks. 

Extending GDS to remote storage with NVIDIA DPUs

While GDS optimizes local storage access, modern AI infrastructure demands scalable solutions that decouple storage from individual GPU nodes. NVIDIA Data Processing Units (DPUs) bridge this gap by offloading storage and networking tasks, enabling remote NVMe-over-Fabric (NVMe-oF) emulation1 using the DPU’s SNAP framework. Solidigm PCIe Gen5 SSDs can be virtualized as remote drives over high-speed fabrics allowing GPUs to access distributed storage pools. This architecture combines GDS’s direct data paths with DPU-driven fabric scalability, delivering a unified solution for AI workloads.  

System configuration

Hardware

  1. Server: Supermicro ARS-111GL-NHR 
  2. CPU/GPU:- Grace Hopper 200
  3. SSDs: Solidigm™ D7-PS1010 (E1.S, 7.68TB, PCIe 5.0) & Solidigm™ D7-P5520 (U.2, 7.68TB, PCIe 4.0)

Software

  1. OS: Ubuntu 22.04.5 LTS 
  2. Kernel: 6.8.0-1021-nvidia-64k
  3. Cuda: 12.6
  4. GDSIO: 1.11

Methodology

Two data paths are compared  

1. GDS path: Direct DMA transfers between GPU and SSD. 

2. Traditional path: Data moves from SSD → CPU/RAM → GPU

Data path with NVIDIA GDS and without NVIDIA GDS Figure 1. GDS path with direct DMA transfers between GPU and SSD vs. Figure 2. Traditional path without GDS

Benchmark parameters

  • Block Sizes: 64KiB, 128KiB, 512KiB, 1024KiB, 4096KiB 
  • Workloads: Sequential read
  • Queue Depth (QD): 24/32  
  • Metrics: Throughput (GB/s), CPU USR utilization (%)
  • Runtime: 45 seconds
  • Server power consumption (Watts)

Results

Drive D7-P5520 - 7.68TB (PCIe Gen4)
Test GDS Path CPU-GPU (Traditional Path)
IO Size Throughput (GiBps) CPU_USR(%) Throughput (GiBps) CPU_USR(%)
64KiB 4.35 0.14 4.30 0.92
128KiB 5.21 0.08 5.18 0.56
512KiB 6.50 0.03 6.51 0.20
1024KiB 6.59 0.02 6.64 0.12
4096KiB 6.62 0.01 6.63 0.06

Table 1. Solidigm D7-P5520 FW: 9CV10330 (U.2, 7.68TB, PCIe 4.0)

Drive D7-PS1010 - 7.68 TB (PCIe Gen5)
Test GDS Path CPU-GPU (Traditional Path)
IO Size Throughput (GiBps) CPU_USR(%) Throughput (GiBps) CPU_USR(%)
64KiB 12.38 0.51 12.70 3.15
128KiB 13.20 0.27 13.48 1.64
512KiB 13.41 0.04 13.48 0.46
1024KiB 13.48 0.02 13.48 0.29
4096KiB 13.48 0.01 13.48 0.14

Table 2. Solidigm D7-PS1010 FW: G77YG100 (E1.S, 7.68TB, PCIe 5.0)

Solidigm D7-P5520 thoughput for NVIDIA GDS vs CPU-GPU. Figure 3. Solidigm D7-P5520 thoughput
Solidigm D7-P5520 utilization for NVIDIA GDS vs CPU-GPU. Figure 4. Solidigm D7-P5520 utilization
Solidigm D7-PS1010 thoughput for NVIDIA GDS vs CPU-GPU. Figure 5. Solidigm D7-PS1010 thoughput
Solidigm D7-P5520 utilization for NVIDIA GDS vs CPU-GPU. Figure 6. Solidigm D7-PS1010 utilization
Average server power consumption for 100 cycles in Watts for NVIDIA GDS vs CPU-GPU. Figure 7. Average server power consumption for 100 cycles in Watts

Key takeaway and analysis

  1. GDS consistently achieves similar performance and throughput as the traditional CPU-GPU path across all block sizes.
  2. We notice that GDS reduces CPU utilization, freeing cores for application tasks by avoiding redundant data copies to RAM using CPU.
  3. While executing GDS workloads, we see about 7 Watts lower server power consumption as compared to CPU-GPU workloads. This data is consistent across multiple runs (100 cycles).

Remote storage performance with NVIDIA Magnum IO architecture

In this section, we will be demonstrating Solidigm SSD performance in NVIDIA Magnum IO architecture which includes NVIDIA Magnum IO GPUDirect Storage and NVIDIA NVMe SNAP.1

NVIDIA DPU (Data processing unit)

A DPU is a specialized processor designed to offload infrastructure tasks (networking, storage, security) from CPUs. NVIDIA Bluefield DPU’s combines multi-core Arm CPUs, high-speed networking, and hardware accelerators to optimize data center efficiency. 

SNAP (Software-defined NVMe access protocol)

SNAP is a DPU-accelerated framework that virtualizes remote SSDs as local NVMe drives. Running in containers on NVIDIA DPUs, SNAP translates local NVMe commands into NVMe-oF protocol packets, enabling direct RDMA transfers between remote storage and GPU memory. 

NVMe over Fabric (NVMe-oF)

NVMe-oF extends the NVMe protocol to access remote storage devices over networks such as InfiniBand. This enables shared storage pools, scalable resource allocation, and allows GPUs and servers to treat high-performance SSDs as if they were locally attached.

NVIDIA Magnum IO architecture with Solidigm SSDs Figure 8. NVIDIA Magnum IO architecture with Solidigm SSDs

End-to-end workflow

  1. Host server initiates sequential read via GDS 
    The GPU server triggers a sequential read operation using the `gdsio` benchmarking tool, which is designed to leverage NVIDIA GPUDirect Storage (GDS). This tool bypasses the CPU and system memory entirely, issuing native NVMe read commands directly from the GPU’s memory space to the DPU-emulated1 NVMe drive.  
  2. DPU intercepts and translates NVMe commands
    The DPU, acting as the controller for the emulated1 NVMe drive, intercepts the NVMe read commands. Using its integrated SNAP framework, the DPU translates these commands into NVMe-oF protocol packets. This translation preserves the semantics of local NVMe operations while adapting them for remote storage access over the network.
  3. RDMA transfer over fabric
    The translated NVMe-oF commands are transmitted over a high-speed InfiniBand RDMA fabric, which connects the GPU server to the remote storage server housing the physical Solidigm PCIe Gen5 SSD. Data flows from the remote SSD directly into the GPU’s memory buffer, with no intermediate staging in host memory.  
  4. Direct GPU memory placement
    DPU’s SNAP framework ensures the retrieved data is placed directly into the GPU’s memory space via RDMA, completing the read operation. This end-to-end path eliminates CPU involvement, maintaining near-local latency and maximizing throughput.  
  5. SNAP queue
    In Storage-Defined Network Accelerated Processing (SNAP), queues enable parallel processing of I/O operations, enhancing throughput and reducing latency. Using 32 queues instead of 1 allows for better load distribution across multiple cores, preventing bottlenecks and improving performance. This setup is crucial for handling high-traffic applications efficiently, ensuring faster response times and scalability. 

Performance benchmarks: Remote vs. local storage

System configuration for remote setup

Storage server

  1. Server: Supermicro AS1115C-TNR
  2. CPU: AMD EPYC 9124 (PCIe 5.0)
  3. DPU/NIC: B3140 Bluefield DPU 3 
  4. SSD:  
    Solidigm D7-PS1010 (E1.S, 7.68 TB, PCIe 5.0)
    Solidigm D7-P5520 (U.2, 7.68TB, PCIe 4.0)
  5. OS: Ubuntu 20.04.6 LTS 
  6. Kernel: 5.4.0-205-generic

Compute server

  1. Server: Supermicro ARS-111GL-NHR 
  2. CPU/GPU: Grace Hopper 200
  3. DPU: B3240 Bluefield DPU 3
  4. OS: Ubuntu 22.04.5 LTS 
  5. Kernel: 6.8.0-1021-nvidia-64k
  6. Cuda: 12.6
  7. GDSIO: 1.11

Methodology

Two data paths are compared

  1. Local Storage: Direct-attached SSD accessed via GDS.  
  2. Remote Storage: DPU-emulated1 NVMe-oF drive (SSD over InfiniBand) via GDS.

Benchmark parameters

  • Block Sizes: 64KiB, 128KiB, 512KiB, 1024KiB, 4096KiB 
  • SNAP Queue: 1, 7, 15, 23, 31
  • Workloads: Sequential Read
  • Queue Depth (QD): 24/32  
  • Metrics: Throughput (GB/s)
  • Runtime: 45 seconds

Results

IO Size PCIe 4.0 – Solidigm D7-D5520 7.68TB FW: 9CV10330  PCIe 5.0 – Solidigm D7-PS1010 7.68TB FW:G77YG100
Direct Setup (GiBps) Remote Setup (GiBps) Direct Setup (GiBps) Remote Setup (GiBps)
64KiB 4.42 4.14 12.38 10.42
128KiB 5.27 5.07 13.20 13.16
512KiB 6.50 6.45 13.41 13.50
1024KiB 6.58 6.70 13.48 13.85
4096KiB 6.46 6.50 13.48 13.85

Table 3. PCIe 4.0 vs. PCIe 5.0 results

Impact of SNAP queue on Solidigm D7-PS1010 E1.S 7.68TB 

SNAP Queue 1  
(GiBps)
7  
(GiBps)
15 
(GiBps)
23 
(GiBps)
31 (GiBps)
64KiB 6.77 8.06 9.48 9.78 10.68
128KiB 9.18 11.1 12.68 12.73 12.93
512KiB 9.44 11.15 12.53 13.06 13.09
1024KiB 9.56 12.25 12.59 13.15 13.34
4096KiB 10.57 12.56 13.48 13.67 13.73

Table 4. Impact of SNAP queue

Solidigm D7-P5520 thoughput for direct setup vs. remote setup. Figure 9. Solidigm D7-P5520 throughput
Solidigm D7-PS-1010 thoughput for direct setup vs. remote setup. Figure 10. Solidigm D7-PS1010 throughput
SNAP queue impact for direct setup vs. remote setup. Figure 11. SNAP queue impact

Key takeaway and analysis

1. Throughput parity

NVIDIA Bluefield DPU’s protocol offloading and RDMA minimize fabric overhead, enabling near-local throughput. As seen in the graphs for both PCIe Gen4 and PCIe Gen5, the throughput for remote storage setup is comparable to that of local storage setup.

2. Block size impact

We can also observe that as the block size increases the throughput for remote storage setup slightly increases as compared to local storage setup, while for smaller block sizes, the throughput is slightly lower for remote storage setup, as the overheads are higher for smaller block sizes over fabric.

3. SNAP queue impact

SNAP queue is another important factor to consider while enabling remote storage setup. By increasing the number of SNAP queues to 32, we see higher throughput due to more I/O requests being handled simultaneously and reducing potential bottlenecks as displayed in the graph for various queue sizes.

Conclusion and future directions

This white paper demonstrates that NVIDIA GPUDirect Storage, combined with Solidigm PCIe Gen5 SSD and DPU-driven NVMe-oF emulation,1 enables remote storage performance parity with local NVMe drive. By eliminating CPU bottlenecks and leveraging RDMA over high-speed InfiniBand fabrics, the architecture unlocks scalable, cost-efficient AI infrastructure without compromising acceleration, demonstrating that centralized storage pools can replace local drives while maintaining GPU workload efficiency.  

The implications for AI infrastructure

Organizations can reduce hardware sprawl and operational costs by decoupling storage from compute nodes, while energy savings from DPU offloading and streamlined data paths support sustainable scaling. This approach is particularly impactful for distributed training workflows and edge inferencing deployments, where low-latency access to shared datasets is critical.  

Looking ahead, advancements in 800G networking, DPU-accelerated computational storage, and deeper integration with Kubernetes and ML frameworks can further solidify this architecture as the foundation for next-gen AI data centers.  

Recommendations

Organizations can consider adopting GDS and DPUs to future-proof their AI infrastructure with Solidigm PCIe Gen4 and/or Gen5 SSDs for bulk data workloads and deploy RDMA-enabled fabrics to minimize latency. This unified architecture can help enterprises scale GPU resources while maintaining performance and cost efficiency.  

 


About the Authors

Ashwin Pai is a System Validation Engineer at Solidigm, with nearly a decade of experience in software, hardware, and systems engineering. He focuses on validating next-generation SSD technologies across diverse platforms, including those optimized for AI and data-intensive workloads. Ashwin collaborates across cross-functional teams utilizing advanced AI methodologies and breakthrough innovations to enhance the capabilities of Solidigm SSDs in AI-driven environments. He holds a Bachelor of Engineering in Electronics from VES Institute of Technology and an M.S. in Computer Engineering from North Carolina State University.

Akhil Srinivas is an Electrical & Systems Engineer at Solidigm. He collaborates with industry-leading ecosystem vendors to validate Solidigm SSDs for cutting-edge storage solutions. He leverages emerging AI technologies and pathfinding innovations to position Solidigm SSDs as critical components in next-generation platforms, strengthening partnerships in the AI space. Beyond the enterprise, he indulges in culinary adventures, exploring popular food trucks and restaurants across the country. Akhil holds a Bachelor of Telecommunications Engineering from R.V. College of Engineering and an M.S. in Electrical and Computer Engineering from University of California, Davis.

Appendix

1. We have referred to the following links for GDS and DOCA setup and installation.

2. Workloads executed

  • gdsio -T 45  -s 512M -I 0 -x 0 -D /mnt -w 32 -d 0 -i 64k 
  • gdsio -T 45  -s 512M -I 0 -x 0 -D /mnt -w 32 -d 0 -i 128k 
  • gdsio -T 45  -s 2048M -I 0 -x 0 -D /mnt -w 24 -d 0 -i 512k 
  • gdsio -T 45  -s 2048M -I 0 -x 0 -D /mnt -w 24 -d 0 -i 1024k 
  • gdsio -T 45  -s 2048M -I 0 -x 0 -D /mnt -w 24 -d 0 -i 4096k

<T> specifies the duration of the test in seconds

<s> sets the size of the dataset

<I> indicates the iteration count, where 0 typically means continuous or unlimited iterations until the test duration is reached

<x> defines the transfer type, with 0 usually representing a read operation

<D> sets the directory path where the test files will be stored

<w> specifies the number of worker threads to be used during the test 

<d> indicates the GPU device ID to be used, <i> sets the I/O size

3. We captured the Server Power Consumption using Server Management Console.

Notes

1. In NVIDIA DOCA SNAP, emulation refers to creating a software-based NVMe device that behaves like real hardware to the host system.

Disclaimers

©2025, Solidigm. “Solidigm” is a registered trademark of SK hynix NAND Product Solutions Corp (d/b/a Solidigm) in the United States, People’s Republic of China, Singapore, Japan, the European Union, the United Kingdom, Mexico, and other countries.

Other names and brands may be claimed as the property of others. 

Solidigm may make changes to specifications and product descriptions at any time, without notice. 

Tests document the performance of components on a particular test, in specific systems. 

Differences in hardware, software, or configuration will affect actual performance.

Consult other sources of information to evaluate performance as you consider your purchase. 

These results are preliminary and provided for information purposes only. These values and claims are neither final nor official. 

Drives are considered engineering samples. Refer to roadmap for production guidance.