Performance of Solidigm™ SSDs with BaM and GIDS

Introduction to Big Accelerator Memory (BaM) and GPU Initiated Direct Storage (GIDS)

AI depictions of GPU usage and cutting of training times with BaM and GIDS
AI depictions of GPU usage and cutting of training times with BaM and GIDS

The exponential growth and complexity of Graph Neural Networks (GNNs) have intensified demands on AI training infrastructure. Heterogeneous datasets like IGBH (with billions of edges and high dimensional features) require massive amounts of data movement between storage and GPU memory. Traditional storage infrastructure, where the CPU handles all data transfer, creates data bottlenecks and stalls pipelines leading to underutilized GPUs and extended training times. This bottleneck is addressed through two open-source research projects:

1. BaM (Big Accelerator Memory) 

BaM rethinks data placement by enabling the GPU to manage storage interactions with NVMe SSDs instead of the traditional CPU model. BaM bypasses legacy drivers by using a custom software stack that utilizes parallelism in GPUs for bulk transfers from storage devices.

2. GIDS (GPU Initiated Direct Storage) 

Complementing BaM is GIDS dataloader, which is built on top of BaM software architecture. GIDS enables the GPU to directly initiate and manage storage I/Os to specifically address requirements for GNN AI training, eliminating CPU intervention entirely. Together, BaM and GIDS minimize latency, maximize throughput, and saturate GPUs with data for optimal utilization.   

Solidigm evaluation of BaM and GIDS

The Solidigm™ D7-PS1010 SSD used in our testing enables BaM and GIDS technologies to function as tangible accelerators for real world GNN AI training workloads, such as the IGB heterogeneous dataset. This white paper presents the findings from BaM and GIDS technology evaluations conducted using Solidigm D7-PS1010 SSDs.

Architecture for training

Architecture for GNN training using BaM and GIDS

GIDS is designed to accelerate GNN training by shifting data loading from CPU to GPU. Unlike  Deep Graph Library (DGL), the traditional CPU bound mmap (Memory-Mapped) approach, GIDS delivers faster end to end data loading by harnessing direct GPU data transfers from Solidigm D7-PS1010 SSDs using BaM, enabling greater efficiency for GNN training. 

GIDS dataloader workflow for GNN training Figure 1. GIDS dataloader

 GIDS optimizes storage access for graph workloads via four key innovations:   

  1. Dynamic storage access accumulator: Asynchronously aggregates storage requests from GPU threads into contiguous access blocks to achieve peak SSD bandwidth. 
  2. Window buffering: Optimize cache usage by predicting future data access patterns eliminating PCIe bus stalling during node/edge feature loading.   
  3. Constant CPU buffer: Fixed size pinned memory region to store frequently accessed data. 
  4. BaM software stack: Uses custom storage driver to leverage parallelism of GPU.

Architecture for GNN training using DGL mmap

DGL mmap (memory mapped I/O in Deep Graph Library) is a traditional, CPU-driven approach to graph data loading. Unlike GIDS, it offers performance tradeoffs, especially in large-scale GPU workloads. 

DGL mmap, a traditional CPU driven approach to graph data loading which offers performance tradeoffs, especially in large-scale GPU workloads. Figure 2. Deep graph library loader

Key characteristics of DGL mmap: 

  1. Memory mapped file access: Uses OS level mmap to map graph data (nodes, edges, features) into virtual memory. 
  2. CPU orchestrated data movement: The CPU handles all data access and transfers to the GPU via PCIe. 
  3. No GPU aware caching: Lacks predictive caching or buffering mechanisms. 

Testing methodology

We are comparing the end to end time (100 steps) for data loading efficiency using GIDS' GPU centric architecture with Solidigm D7-PS1010 SSDs versus using DGL mmap. We have tested both small and full heterogeneous datasets from the Illinois Graph Benchmark (IGBH) for training the GNN. 

1. System configuration

Software:

  1. OS: Ubuntu 20.04.05 LTS
  2. Kernel: 5.8.0
  3. CUDA 12.9
  4. BaM 
  5. GIDS dataloader
  6. Python DeepGraphLibrary

Hardware:

  1. Server: Supermicro SYS-421GE-TNRT  
  2. CPU: Intel Xeon Silver 4516Y
  3. GPU: Nvidia L40S
  4. SSD: Solidigm™ D7-PS1010 (E1.S, 3.84TB, PCIe 5.0 air-cooled)

2. Benchmark parameters

  • Dataset: IGB Heterogenous Small (10M Edges), IGB Heterogenous Full (5.8B Edges)
  • Baseline: DGL mmap (CPU centric dataloader) 
  • Test Tool: GIDS dataloader 
  • Metrics: End to End runtime (seconds) for 100 steps during training.
  • Each step samples neighbors for a batch of 512 seed nodes across 2 layers.
  • GNN model: Relational Graph Attention Network (RGAT).
  • Window Buffering: 8GB  
  • Constant CPU buffer: 20% of dataset size.

The benchmark accuracy was set to 70% during training to align with the MLPerf target for IGBH.  Note: MLPerf is an industry-standard benchmark suite for evaluating machine learning performance across diverse workloads, and this threshold indicates that the model has reached a meaningful level of convergence and quality.

We used publicly accessible code repositories to enable and run GIDS and BaM, with datasets from the Illinois Graph Benchmark suite. See appendix for links.

3. Results

Accuracy of model after training with IGBH small dataset (10M Edges)

The training accuracy for both data loaders meet the baseline threshold of 70%, validating the training model.

Graph showing  the training accuracy for both data loaders meet the baseline threshold of 70%, validating the training model. Figure 3. Accuracy of model after training with IGBH small dataset (10M Edges)

E2E time for IGBH small dataset (10M Edges) using a single Solidigm D7-PS1010 drive for 100 steps

When using Solidigm D7-PS1010 PCIe 5.0 SSD, GIDS cuts load time by almost 2x for small graphs by minimizing overhead and utilizing parallelism.

Graph showing that when using Solidigm D7-PS1010 PCIe 5.0 SSD, GIDS cuts load time by almost 2x for small graphs by minimizing overhead and utilizing parallelism. Figure 4. E2E time for IGBH small dataset (10M Edges)

Accuracy of model after training with IGBH full dataset (5.8B Edges)

The training accuracy for both data loaders meets the baseline threshold of 70% validating the trained model.

Graph showing the training accuracy for both data loaders meets the baseline threshold of 70% validating the trained model. Figure 5. Accuracy of model after training with IGBH full dataset (5.8B Edges)

E2E time for IGBH full dataset (5.8B Edges) using a single Solidigm D7-PS1010 drive for 100 steps

GIDS scales with increasing edges by almost 9x compared to DGL mmap.

E2E time for IGBH full dataset (5.8B Edges) using a single Solidigm D7-PS1010 drive for 100 steps shows GIDS scales with increasing edges by almost 9x compared to DGL mmap Figure 6. E2E time for IGBH full dataset (5.8B Edges)

GPU cache metrics achieved when using BaM and GIDS with IGBH dataset

Dataset Cache Hit Rate Cache Miss Rate
Small 77.49% 22.51%
Full 10.80% 89.20%

Table 1. GPU Cache metrics achieved when using BAM & GiDS with IGBH dataset

Table 1 compares GPU cache performance metrics when training a GNN using the GIDS data loader on two dataset scales (Small vs Full) of the IGBH heterogeneous graph. The metrics include:

  • Cache Hit Rate: Percentage of feature fetch requests served from the software cache (multi layered cache including GPU window buffer, Constant CPU buffer) rather than SSD under BAM.
  • Cache Miss Rate: Percentage of requests that require fetching from SSD under BAM because the data was not present in the software cache.

Cache hit and miss rates differ drastically between the small and full dataset (77% hits for the small run versus about 10% for the full dataset). However, this doesn’t translate into a huge end to end time gap because GPU compute dominates the training loop and overlaps with I/O. Since data is pre-stripped and accessed through BaM with prefetching, SSD reads are handled efficiently during computation, reducing the impact of cache misses on overall runtime.

DeepGraphLibrary method is not included in the table because it does not have any mechanisms for GPU software caching or SSD aware prefetching to report software cache access.

Conclusion and Future Directions

The Solidigm D7-PS1010 SSD unlocks performance gains when paired with BaM and GIDS technologies for GNN training. 

  • Utilizing the fast sequential and random read capabilities of Solidigm D7-PS1010 SSDs, integration with BaM and GIDS accelerates the ingestion and preparation of large scale graph datasets. 
  • GIDS significantly reduces data load times, achieving up to 2x faster loading for the IGBH-small dataset (10M edges) when compared to the traditional CPU bound DGL mmap approach. 

For the largest evaluated dataset, IGBH-full dataset (5.8 billion edges), GIDS maintains high speed data loading that is 9 times faster than legacy methods.  Our work on BaM and GIDS lays a solid foundation towards readiness of Solidigm D7-PS1010 SSDs for future storage configurations.


About the Authors

Ashwin Pai is a System Validation Engineer at Solidigm, with nearly a decade of experience in software, hardware, and systems engineering. He focuses on validating next-generation SSD technologies across diverse platforms, including those optimized for AI and data-intensive workloads. Ashwin collaborates across cross-functional teams utilizing advanced AI methodologies and breakthrough innovations to enhance the capabilities of Solidigm SSDs in AI-driven environments. He holds a Bachelor of Engineering in Electronics from VES Institute of Technology and an M.S. in Computer Engineering from North Carolina State University.

Akhil Srinivas is an Electrical & Systems Engineer at Solidigm. He collaborates with industry-leading ecosystem vendors to validate Solidigm SSDs for cutting-edge storage solutions. He leverages emerging AI technologies and pathfinding innovations to position Solidigm SSDs as critical components in next-generation platforms, strengthening partnerships in the AI space. Beyond the enterprise, he indulges in culinary adventures, exploring popular food trucks and restaurants across the country. Akhil holds a Bachelor of Telecommunications Engineering from R.V. College of Engineering and an M.S. in Electrical and Computer Engineering from University of California, Davis.

About the Solidigm D7 PS1010 SSD E1.S form factor

The Solidigm D7-PS1010 E1.S is the leading SSD for AI workloads. The E1.S D7-PS1010 is offered in the following form factors targeted for specific AI server environments:

Form Factor Capacity
3.84TB 7.68TB 3.84TB 7.68TB
9.5mm
  • Liquid Cooled 

  • Air Cooled

  • Liquid Cooled 

  • Air Cooled

  • Liquid Cooled 

  • Air Cooled

  • Liquid Cooled 

  • Air Cooled

15mm
  • Air Cooled Only
  • Air Cooled Only
  • Air Cooled Only
  • Air Cooled Only

For more details about the Solidigm 3.84TB E1.S D7 PS1010, please visit:  

https://www.solidigm.com/E1.S-D7-PS1010-3.84TB

Appendix

We have referred to the following links for BaM and GIDS tools and scripts.

  1. https://github.com/jeongminpark417/GIDS
  2. https://github.com/ZaidQureshi/bam
  3. https://github.com/IllinoisGraphBenchmark/IGB-Datasets

Disclaimers

©2026, Solidigm. “Solidigm” is a registered trademark of SK hynix NAND Product Solutions Corp. (d/b/a Solidigm) in the United States, People’s Republic of China, Singapore, Japan, the European Union, the United Kingdom, Mexico, and other countries.

Other names and brands may be claimed as the property of others. 

Solidigm may make changes to specifications and product descriptions at any time, without notice. 

Tests document the performance of components on a particular test, in specific systems. 

Differences in hardware, software, or configuration will affect actual performance.

 Consult other sources of information to evaluate performance as you consider your purchase. 

These results are preliminary and provided for information purposes only. These values and claims are neither final nor official. 

Drives are considered engineering samples. Refer to roadmap for production guidance.