Accelerating AI With High Performance Storage

Solidigm™ SSDs with NVIDIA® Magnum IO Architecture which includes GPUDirect® Storage and NVIDIA Bluefield®- 3 Data Processing Unit Driven NVMe emulation

Performance of Solidigm SSDs with NVIDIA GDS vs CPU-GPU data path

The GPU and storage bottleneck

Modern AI workloads demand unprecedented data throughput and low-latency access to massive datasets. Traditional storage architectures, reliant on CPU-facilitated data movement between NVMe SSDs and GPUs, struggle to keep pace with GPU compute capabilities. Data center SSDs, like the Solidigm™ D7-PS1010, deliver up to 14,500 MB/s sequential read speeds, but unlocking their full potential requires rethinking how GPUs interact with storage both locally and across distributed remote systems.

NVIDIA GPUDirect Storage (GDS)

NVIDIA GPUDirect Storage (GDS) eliminates the CPU bottleneck by enabling direct memory access (DMA) between GPUs and NVMe SSDs. As part of the NVIDIA Magnum IO SDK, GDS integrates with frameworks like CUDA to bypass CPU/RAM data staging, reducing latency, and freeing CPU resources for critical management tasks.

Extending GDS to remote storage with NVIDIA DPUs

While GDS optimizes local storage access, modern AI infrastructure demands scalable solutions that decouple storage from individual GPU nodes. NVIDIA Data Processing Units (DPUs) bridge this gap by offloading storage and networking tasks, enabling remote NVMe-over-Fabric (NVMe-oF) emulation¹ using the DPU’s SNAP framework. Solidigm PCIe Gen5 SSDs can be virtualized as remote drives over high-speed fabrics allowing GPUs to access distributed storage pools. This architecture combines GDS’s direct data paths with DPU-driven fabric scalability, delivering a unified solution for AI workloads.  

System configuration

Hardware

Server: Supermicro ARS-111GL-NHR
CPU/GPU:- Grace Hopper 200
SSDs: Solidigm™ D7-PS1010 (E1.S, 7.68TB, PCIe 5.0) & Solidigm™ D7-P5520 (U.2, 7.68TB, PCIe 4.0)

Software

OS: Ubuntu 22.04.5 LTS
Kernel: 6.8.0-1021-nvidia-64k
Cuda: 12.6
GDSIO: 1.11

Methodology

Two data paths are compared

1. GDS path: Direct DMA transfers between GPU and SSD.

2. Traditional path: Data moves from SSD → CPU/RAM → GPU

Data path with NVIDIA GDS and without NVIDIA GDS

Figure 1. GDS path with direct DMA transfers between GPU and SSD vs. Figure 2. Traditional path without GDS

Benchmark parameters

Block Sizes: 64KiB, 128KiB, 512KiB, 1024KiB, 4096KiB
Workloads: Sequential read
Queue Depth (QD): 24/32
Metrics: Throughput (GB/s), CPU USR utilization (%)
Runtime: 45 seconds
Server power consumption (Watts)

Results

Drive	D7-P5520 - 7.68TB (PCIe Gen4)
Test	GDS Path		CPU-GPU (Traditional Path)
IO Size	Throughput (GiBps)	CPU_USR(%)	Throughput (GiBps)	CPU_USR(%)
64KiB	4.35	0.14	4.30	0.92
128KiB	5.21	0.08	5.18	0.56
512KiB	6.50	0.03	6.51	0.20
1024KiB	6.59	0.02	6.64	0.12
4096KiB	6.62	0.01	6.63	0.06

Table 1. Solidigm D7-P5520 FW: 9CV10330 (U.2, 7.68TB, PCIe 4.0)

Drive	D7-PS1010 - 7.68 TB (PCIe Gen5)
Test	GDS Path		CPU-GPU (Traditional Path)
IO Size	Throughput (GiBps)	CPU_USR(%)	Throughput (GiBps)	CPU_USR(%)
64KiB	12.38	0.51	12.70	3.15
128KiB	13.20	0.27	13.48	1.64
512KiB	13.41	0.04	13.48	0.46
1024KiB	13.48	0.02	13.48	0.29
4096KiB	13.48	0.01	13.48	0.14

Table 2. Solidigm D7-PS1010 FW: G77YG100 (E1.S, 7.68TB, PCIe 5.0)

Solidigm D7-P5520 thoughput for NVIDIA GDS vs CPU-GPU.

Figure 3. Solidigm D7-P5520 thoughput

Solidigm D7-P5520 utilization for NVIDIA GDS vs CPU-GPU.

Figure 4. Solidigm D7-P5520 utilization

Solidigm D7-PS1010 thoughput for NVIDIA GDS vs CPU-GPU.

Figure 5. Solidigm D7-PS1010 thoughput

Figure 6. Solidigm D7-PS1010 utilization

Average server power consumption for 100 cycles in Watts for NVIDIA GDS vs CPU-GPU.

Figure 7. Average server power consumption for 100 cycles in Watts

Key takeaway and analysis

GDS consistently achieves similar performance and throughput as the traditional CPU-GPU path across all block sizes.
We notice that GDS reduces CPU utilization, freeing cores for application tasks by avoiding redundant data copies to RAM using CPU.
While executing GDS workloads, we see about 7 Watts lower server power consumption as compared to CPU-GPU workloads. This data is consistent across multiple runs (100 cycles).

Remote storage performance with NVIDIA Magnum IO architecture

In this section, we will be demonstrating Solidigm SSD performance in NVIDIA Magnum IO architecture which includes NVIDIA Magnum IO GPUDirect Storage and NVIDIA NVMe SNAP.¹

NVIDIA DPU (Data processing unit)

A DPU is a specialized processor designed to offload infrastructure tasks (networking, storage, security) from CPUs. NVIDIA Bluefield DPU’s combines multi-core Arm CPUs, high-speed networking, and hardware accelerators to optimize data center efficiency.

SNAP (Software-defined NVMe access protocol)

SNAP is a DPU-accelerated framework that virtualizes remote SSDs as local NVMe drives. Running in containers on NVIDIA DPUs, SNAP translates local NVMe commands into NVMe-oF protocol packets, enabling direct RDMA transfers between remote storage and GPU memory.

NVMe over Fabric (NVMe-oF)

NVMe-oF extends the NVMe protocol to access remote storage devices over networks such as InfiniBand. This enables shared storage pools, scalable resource allocation, and allows GPUs and servers to treat high-performance SSDs as if they were locally attached.

Figure 8. NVIDIA Magnum IO architecture with Solidigm SSDs

End-to-end workflow

Host server initiates sequential read via GDS 
The GPU server triggers a sequential read operation using the `gdsio` benchmarking tool, which is designed to leverage NVIDIA GPUDirect Storage (GDS). This tool bypasses the CPU and system memory entirely, issuing native NVMe read commands directly from the GPU’s memory space to the DPU-emulated¹ NVMe drive.  
DPU intercepts and translates NVMe commands
The DPU, acting as the controller for the emulated¹ NVMe drive, intercepts the NVMe read commands. Using its integrated SNAP framework, the DPU translates these commands into NVMe-oF protocol packets. This translation preserves the semantics of local NVMe operations while adapting them for remote storage access over the network.
RDMA transfer over fabric
The translated NVMe-oF commands are transmitted over a high-speed InfiniBand RDMA fabric, which connects the GPU server to the remote storage server housing the physical Solidigm PCIe Gen5 SSD. Data flows from the remote SSD directly into the GPU’s memory buffer, with no intermediate staging in host memory.  
Direct GPU memory placement
DPU’s SNAP framework ensures the retrieved data is placed directly into the GPU’s memory space via RDMA, completing the read operation. This end-to-end path eliminates CPU involvement, maintaining near-local latency and maximizing throughput.  
SNAP queue
In Storage-Defined Network Accelerated Processing (SNAP), queues enable parallel processing of I/O operations, enhancing throughput and reducing latency. Using 32 queues instead of 1 allows for better load distribution across multiple cores, preventing bottlenecks and improving performance. This setup is crucial for handling high-traffic applications efficiently, ensuring faster response times and scalability.

Performance benchmarks: Remote vs. local storage