Table of Contents
Modern AI workloads demand unprecedented data throughput and low-latency access to massive datasets. Traditional storage architectures, reliant on CPU-facilitated data movement between NVMe SSDs and GPUs, struggle to keep pace with GPU compute capabilities. Data center SSDs, like the Solidigm™ D7-PS1010, deliver up to 14,500 MB/s sequential read speeds, but unlocking their full potential requires rethinking how GPUs interact with storage both locally and across distributed remote systems.
NVIDIA GPUDirect Storage (GDS) eliminates the CPU bottleneck by enabling direct memory access (DMA) between GPUs and NVMe SSDs. As part of the NVIDIA Magnum IO SDK, GDS integrates with frameworks like CUDA to bypass CPU/RAM data staging, reducing latency, and freeing CPU resources for critical management tasks.
While GDS optimizes local storage access, modern AI infrastructure demands scalable solutions that decouple storage from individual GPU nodes. NVIDIA Data Processing Units (DPUs) bridge this gap by offloading storage and networking tasks, enabling remote NVMe-over-Fabric (NVMe-oF) emulation1 using the DPU’s SNAP framework. Solidigm PCIe Gen5 SSDs can be virtualized as remote drives over high-speed fabrics allowing GPUs to access distributed storage pools. This architecture combines GDS’s direct data paths with DPU-driven fabric scalability, delivering a unified solution for AI workloads.
Hardware
Software
Two data paths are compared
1. GDS path: Direct DMA transfers between GPU and SSD.
2. Traditional path: Data moves from SSD → CPU/RAM → GPU
Drive | D7-P5520 - 7.68TB (PCIe Gen4) | |||
---|---|---|---|---|
Test | GDS Path | CPU-GPU (Traditional Path) | ||
IO Size | Throughput (GiBps) | CPU_USR(%) | Throughput (GiBps) | CPU_USR(%) |
64KiB | 4.35 | 0.14 | 4.30 | 0.92 |
128KiB | 5.21 | 0.08 | 5.18 | 0.56 |
512KiB | 6.50 | 0.03 | 6.51 | 0.20 |
1024KiB | 6.59 | 0.02 | 6.64 | 0.12 |
4096KiB | 6.62 | 0.01 | 6.63 | 0.06 |
Table 1. Solidigm D7-P5520 FW: 9CV10330 (U.2, 7.68TB, PCIe 4.0)
Drive | D7-PS1010 - 7.68 TB (PCIe Gen5) | |||
---|---|---|---|---|
Test | GDS Path | CPU-GPU (Traditional Path) | ||
IO Size | Throughput (GiBps) | CPU_USR(%) | Throughput (GiBps) | CPU_USR(%) |
64KiB | 12.38 | 0.51 | 12.70 | 3.15 |
128KiB | 13.20 | 0.27 | 13.48 | 1.64 |
512KiB | 13.41 | 0.04 | 13.48 | 0.46 |
1024KiB | 13.48 | 0.02 | 13.48 | 0.29 |
4096KiB | 13.48 | 0.01 | 13.48 | 0.14 |
Table 2. Solidigm D7-PS1010 FW: G77YG100 (E1.S, 7.68TB, PCIe 5.0)
In this section, we will be demonstrating Solidigm SSD performance in NVIDIA Magnum IO architecture which includes NVIDIA Magnum IO GPUDirect Storage and NVIDIA NVMe SNAP.1
A DPU is a specialized processor designed to offload infrastructure tasks (networking, storage, security) from CPUs. NVIDIA Bluefield DPU’s combines multi-core Arm CPUs, high-speed networking, and hardware accelerators to optimize data center efficiency.
SNAP is a DPU-accelerated framework that virtualizes remote SSDs as local NVMe drives. Running in containers on NVIDIA DPUs, SNAP translates local NVMe commands into NVMe-oF protocol packets, enabling direct RDMA transfers between remote storage and GPU memory.
NVMe-oF extends the NVMe protocol to access remote storage devices over networks such as InfiniBand. This enables shared storage pools, scalable resource allocation, and allows GPUs and servers to treat high-performance SSDs as if they were locally attached.
Storage server
Compute server
Two data paths are compared
IO Size | PCIe 4.0 – Solidigm D7-D5520 7.68TB FW: 9CV10330 | PCIe 5.0 – Solidigm D7-PS1010 7.68TB FW:G77YG100 | ||
---|---|---|---|---|
Direct Setup (GiBps) | Remote Setup (GiBps) | Direct Setup (GiBps) | Remote Setup (GiBps) | |
64KiB | 4.42 | 4.14 | 12.38 | 10.42 |
128KiB | 5.27 | 5.07 | 13.20 | 13.16 |
512KiB | 6.50 | 6.45 | 13.41 | 13.50 |
1024KiB | 6.58 | 6.70 | 13.48 | 13.85 |
4096KiB | 6.46 | 6.50 | 13.48 | 13.85 |
Table 3. PCIe 4.0 vs. PCIe 5.0 results
Impact of SNAP queue on Solidigm D7-PS1010 E1.S 7.68TB
SNAP Queue | 1 (GiBps) |
7 (GiBps) |
15 (GiBps) |
23 (GiBps) |
31 (GiBps) |
---|---|---|---|---|---|
64KiB | 6.77 | 8.06 | 9.48 | 9.78 | 10.68 |
128KiB | 9.18 | 11.1 | 12.68 | 12.73 | 12.93 |
512KiB | 9.44 | 11.15 | 12.53 | 13.06 | 13.09 |
1024KiB | 9.56 | 12.25 | 12.59 | 13.15 | 13.34 |
4096KiB | 10.57 | 12.56 | 13.48 | 13.67 | 13.73 |
Table 4. Impact of SNAP queue
NVIDIA Bluefield DPU’s protocol offloading and RDMA minimize fabric overhead, enabling near-local throughput. As seen in the graphs for both PCIe Gen4 and PCIe Gen5, the throughput for remote storage setup is comparable to that of local storage setup.
We can also observe that as the block size increases the throughput for remote storage setup slightly increases as compared to local storage setup, while for smaller block sizes, the throughput is slightly lower for remote storage setup, as the overheads are higher for smaller block sizes over fabric.
SNAP queue is another important factor to consider while enabling remote storage setup. By increasing the number of SNAP queues to 32, we see higher throughput due to more I/O requests being handled simultaneously and reducing potential bottlenecks as displayed in the graph for various queue sizes.
This white paper demonstrates that NVIDIA GPUDirect Storage, combined with Solidigm PCIe Gen5 SSD and DPU-driven NVMe-oF emulation,1 enables remote storage performance parity with local NVMe drive. By eliminating CPU bottlenecks and leveraging RDMA over high-speed InfiniBand fabrics, the architecture unlocks scalable, cost-efficient AI infrastructure without compromising acceleration, demonstrating that centralized storage pools can replace local drives while maintaining GPU workload efficiency.
Organizations can reduce hardware sprawl and operational costs by decoupling storage from compute nodes, while energy savings from DPU offloading and streamlined data paths support sustainable scaling. This approach is particularly impactful for distributed training workflows and edge inferencing deployments, where low-latency access to shared datasets is critical.
Looking ahead, advancements in 800G networking, DPU-accelerated computational storage, and deeper integration with Kubernetes and ML frameworks can further solidify this architecture as the foundation for next-gen AI data centers.
Organizations can consider adopting GDS and DPUs to future-proof their AI infrastructure with Solidigm PCIe Gen4 and/or Gen5 SSDs for bulk data workloads and deploy RDMA-enabled fabrics to minimize latency. This unified architecture can help enterprises scale GPU resources while maintaining performance and cost efficiency.
Ashwin Pai is a System Validation Engineer at Solidigm, with nearly a decade of experience in software, hardware, and systems engineering. He focuses on validating next-generation SSD technologies across diverse platforms, including those optimized for AI and data-intensive workloads. Ashwin collaborates across cross-functional teams utilizing advanced AI methodologies and breakthrough innovations to enhance the capabilities of Solidigm SSDs in AI-driven environments. He holds a Bachelor of Engineering in Electronics from VES Institute of Technology and an M.S. in Computer Engineering from North Carolina State University.
Akhil Srinivas is an Electrical & Systems Engineer at Solidigm. He collaborates with industry-leading ecosystem vendors to validate Solidigm SSDs for cutting-edge storage solutions. He leverages emerging AI technologies and pathfinding innovations to position Solidigm SSDs as critical components in next-generation platforms, strengthening partnerships in the AI space. Beyond the enterprise, he indulges in culinary adventures, exploring popular food trucks and restaurants across the country. Akhil holds a Bachelor of Telecommunications Engineering from R.V. College of Engineering and an M.S. in Electrical and Computer Engineering from University of California, Davis.
1. We have referred to the following links for GDS and DOCA setup and installation.
2. Workloads executed
<T> specifies the duration of the test in seconds
<s> sets the size of the dataset
<I> indicates the iteration count, where 0 typically means continuous or unlimited iterations until the test duration is reached
<x> defines the transfer type, with 0 usually representing a read operation
<D> sets the directory path where the test files will be stored
<w> specifies the number of worker threads to be used during the test
<d> indicates the GPU device ID to be used, <i> sets the I/O size
3. We captured the Server Power Consumption using Server Management Console.
1. In NVIDIA DOCA SNAP, emulation refers to creating a software-based NVMe device that behaves like real hardware to the host system.
©2025, Solidigm. “Solidigm” is a registered trademark of SK hynix NAND Product Solutions Corp (d/b/a Solidigm) in the United States, People’s Republic of China, Singapore, Japan, the European Union, the United Kingdom, Mexico, and other countries.
Other names and brands may be claimed as the property of others.
Solidigm may make changes to specifications and product descriptions at any time, without notice.
Tests document the performance of components on a particular test, in specific systems.
Differences in hardware, software, or configuration will affect actual performance.
Consult other sources of information to evaluate performance as you consider your purchase.
These results are preliminary and provided for information purposes only. These values and claims are neither final nor official.
Drives are considered engineering samples. Refer to roadmap for production guidance.