Performance of Solidigm™ SSDs with BaM and GIDS

Introduction to Big Accelerator Memory (BaM) and GPU Initiated Direct Storage (GIDS)

The exponential growth and complexity of Graph Neural Networks (GNNs) have intensified demands on AI training infrastructure. Heterogeneous datasets like IGBH (with billions of edges and high dimensional features) require massive amounts of data movement between storage and GPU memory. Traditional storage infrastructure, where the CPU handles all data transfer, creates data bottlenecks and stalls pipelines leading to underutilized GPUs and extended training times. This bottleneck is addressed through two open-source research projects:

1. BaM (Big Accelerator Memory) 

BaM rethinks data placement by enabling the GPU to manage storage interactions with NVMe SSDs instead of the traditional CPU model. BaM bypasses legacy drivers by using a custom software stack that utilizes parallelism in GPUs for bulk transfers from storage devices.

2. GIDS (GPU Initiated Direct Storage) 

Complementing BaM is GIDS dataloader, which is built on top of BaM software architecture. GIDS enables the GPU to directly initiate and manage storage I/Os to specifically address requirements for GNN AI training, eliminating CPU intervention entirely. Together, BaM and GIDS minimize latency, maximize throughput, and saturate GPUs with data for optimal utilization.   

Solidigm evaluation of BaM and GIDS

The Solidigm™ D7-PS1010 SSD used in our testing enables BaM and GIDS technologies to function as tangible accelerators for real world GNN AI training workloads, such as the IGB heterogeneous dataset. This white paper presents the findings from BaM and GIDS technology evaluations conducted using Solidigm D7-PS1010 SSDs.

Architecture for training

Architecture for GNN training using BaM and GIDS

GIDS is designed to accelerate GNN training by shifting data loading from CPU to GPU. Unlike Deep Graph Library (DGL), the traditional CPU bound mmap (Memory-Mapped) approach, GIDS delivers faster end to end data loading by harnessing direct GPU data transfers from Solidigm D7-PS1010 SSDs using BaM, enabling greater efficiency for GNN training.

GIDS optimizes storage access for graph workloads via four key innovations:   

Dynamic storage access accumulator: Asynchronously aggregates storage requests from GPU threads into contiguous access blocks to achieve peak SSD bandwidth. 
Window buffering: Optimize cache usage by predicting future data access patterns eliminating PCIe bus stalling during node/edge feature loading.   
Constant CPU buffer: Fixed size pinned memory region to store frequently accessed data. 
BaM software stack: Uses custom storage driver to leverage parallelism of GPU.

Architecture for GNN training using DGL mmap

DGL mmap (memory mapped I/O in Deep Graph Library) is a traditional, CPU-driven approach to graph data loading. Unlike GIDS, it offers performance tradeoffs, especially in large-scale GPU workloads. 

Key characteristics of DGL mmap: 

Memory mapped file access: Uses OS level mmap to map graph data (nodes, edges, features) into virtual memory. 
CPU orchestrated data movement: The CPU handles all data access and transfers to the GPU via PCIe. 
No GPU aware caching: Lacks predictive caching or buffering mechanisms.

Testing methodology

We are comparing the end to end time (100 steps) for data loading efficiency using GIDS' GPU centric architecture with Solidigm D7-PS1010 SSDs versus using DGL mmap. We have tested both small and full heterogeneous datasets from the Illinois Graph Benchmark (IGBH) for training the GNN.

1. System configuration

Software:

OS: Ubuntu 20.04.05 LTS
Kernel: 5.8.0
CUDA 12.9
BaM
GIDS dataloader
Python DeepGraphLibrary

Hardware:

Server: Supermicro SYS-421GE-TNRT 
CPU: Intel Xeon Silver 4516Y
GPU: Nvidia L40S
SSD: Solidigm™ D7-PS1010 (E1.S, 3.84TB, PCIe 5.0 air-cooled)

2. Benchmark parameters

Dataset: IGB Heterogenous Small (10M Edges), IGB Heterogenous Full (5.8B Edges)
Baseline: DGL mmap (CPU centric dataloader) 
Test Tool: GIDS dataloader 
Metrics: End to End runtime (seconds) for 100 steps during training.
Each step samples neighbors for a batch of 512 seed nodes across 2 layers.
GNN model: Relational Graph Attention Network (RGAT).
Window Buffering: 8GB  
Constant CPU buffer: 20% of dataset size.

The benchmark accuracy was set to 70% during training to align with the MLPerf target for IGBH. Note: MLPerf is an industry-standard benchmark suite for evaluating machine learning performance across diverse workloads, and this threshold indicates that the model has reached a meaningful level of convergence and quality.

We used publicly accessible code repositories to enable and run GIDS and BaM, with datasets from the Illinois Graph Benchmark suite. See appendix for links.

3. Results

Accuracy of model after training with IGBH small dataset (10M Edges)

The training accuracy for both data loaders meet the baseline threshold of 70%, validating the training model.

E2E time for IGBH small dataset (10M Edges) using a single Solidigm D7-PS1010 drive for 100 steps

When using Solidigm D7-PS1010 PCIe 5.0 SSD, GIDS cuts load time by almost 2x for small graphs by minimizing overhead and utilizing parallelism.

Accuracy of model after training with IGBH full dataset (5.8B Edges)

The training accuracy for both data loaders meets the baseline threshold of 70% validating the trained model.

E2E time for IGBH full dataset (5.8B Edges) using a single Solidigm D7-PS1010 drive for 100 steps

GIDS scales with increasing edges by almost 9x compared to DGL mmap.

GPU cache metrics achieved when using BaM and GIDS with IGBH dataset

Dataset	Cache Hit Rate	Cache Miss Rate
Small	77.49%	22.51%
Full	10.80%	89.20%

Table 1. GPU Cache metrics achieved when using BAM & GiDS with IGBH dataset

Table 1 compares GPU cache performance metrics when training a GNN using the GIDS data loader on two dataset scales (Small vs Full) of the IGBH heterogeneous graph. The metrics include:

Cache Hit Rate: Percentage of feature fetch requests served from the software cache (multi layered cache including GPU window buffer, Constant CPU buffer) rather than SSD under BAM.
Cache Miss Rate: Percentage of requests that require fetching from SSD under BAM because the data was not present in the software cache.

Cache hit and miss rates differ drastically between the small and full dataset (77% hits for the small run versus about 10% for the full dataset). However, this doesn’t translate into a huge end to end time gap because GPU compute dominates the training loop and overlaps with I/O. Since data is pre-stripped and accessed through BaM with prefetching, SSD reads are handled efficiently during computation, reducing the impact of cache misses on overall runtime.

DeepGraphLibrary method is not included in the table because it does not have any mechanisms for GPU software caching or SSD aware prefetching to report software cache access.

Conclusion and Future Directions

The Solidigm D7-PS1010 SSD unlocks performance gains when paired with BaM and GIDS technologies for GNN training.

Utilizing the fast sequential and random read capabilities of Solidigm D7-PS1010 SSDs, integration with BaM and GIDS accelerates the ingestion and preparation of large scale graph datasets.
GIDS significantly reduces data load times, achieving up to 2x faster loading for the IGBH-small dataset (10M edges) when compared to the traditional CPU bound DGL mmap approach.

For the largest evaluated dataset, IGBH-full dataset (5.8 billion edges), GIDS maintains high speed data loading that is 9 times faster than legacy methods. Our work on BaM and GIDS lays a solid foundation towards readiness of Solidigm D7-PS1010 SSDs for future storage configurations.

What are you looking for?

Welcome

My Profile

mySolidigm

Settings

Sign Out

Performance of Solidigm™ SSDs with BaM and GIDS

Introduction to Big Accelerator Memory (BaM) and GPU Initiated Direct Storage (GIDS)

1. BaM (Big Accelerator Memory)