DPU-Native Distributed Flexible Storage Solution

The rise of high-density QLC SSDs in AI infrastructure

Data center rack with DPU-native distributed flexible storage solution using Solidigm D7-PS1010 SSDs and CSAL.
Data center rack with DPU-native distributed flexible storage solution using Solidigm D7-PS1010 SSDs and CSAL.

As artificial intelligence permeates our world, it is unlocking insights from archival data that, until now, has remained largely untapped. As a result, storage tiers once considered "cool" or dormant are now "warming up," sparking a new race to harness and learn from this vast repository of historical data. Traditionally, hard disk drives (HDDs) have served as the backbone of archival storage, capably handling the requirements of these colder tiers. However, as data is increasingly accessed to fuel AI models and GPU workloads, the limitations of HDDs have become apparent. 

HDD performance can no longer keep pace with the demands of rapid data extraction and delivery. Furthermore, while HDDs have achieved impressive gains in data density, they still fall short compared to the high-density Quad-Level Cell (QLC) SSDs with an expanding market share. QLC SSDs not only offer greater storage density but also significantly reduce rack footprint and enhance power efficiency compared to their HDD counterparts. This technological advantage is prompting a noticeable industry shift toward high-density QLC drives.

Solid-state storage has undergone a remarkable transformation, evolving from SLC (Single-Level Cell) to MLC (Multi-Level Cell), TLC (Triple-Level Cell), and now QLC architectures, each successive leap allowing a single memory cell to store more bits of data. The transition from 2D planar NAND to 3D stacked NAND has further amplified storage capabilities. Today, NAND chip stacks commonly surpass 300 layers in height. In the enterprise space, mainstream single-disk SSDs now deliver capacities of 61.44TB and 122.88TB, with even larger drives on the horizon ushering in a new era of high-performance, high-capacity archival storage.

QLC advantage over hybrid storage solution

When solving for a 50 MW power budget with 16PB Network Attached Storage (NAS), Figure 1 shows how an all-QLC storage solution compares to a TLC+HDD solution. 

All-QLC storage solution vs hybrid storage solution of a TLC+HDD solution. Figure 1. Impact of storage device type on rack space and power consumption

As compared to traditional TLC and HDD hybrid architectures, an all-QLC high-capacity array reduces storage power consumption by approximately 90% while providing the same storage capacity. Additionally, the number of racks required is notably reduced offering 8x space compaction. The decreased number of disk drives reduces the overall weight of the storage equipment as well, substantially lowering the load-bearing requirements for the data center floor. QLC architectures also significantly increase interface bandwidth per terabyte of storage. To see details and assumptions, refer to Appendix A.

Performance Parameters 30TB Exos M HDD 122.88TB Solidigm™ D5-P5336 QLC
Sequential Write 262 MB/s 3000 MB/s
Sequential Read 275 MB/s 7000 MB/s
Random Write 350 IOPS 19,000 IOPS
Random Read 170 IOPS 900,000 IOPS

Table 1. Performance comparison between HDD and Solidigm D5-P5336 QLC SSD

Table 1 compares the read/write performance of 30TB HAMR HDDs1 and Solidigm D5-P5336 122.88TB QLC SSDs.2 QLC SSDs, in general, exhibit significantly higher read/write speeds than HDDs and offer much higher densities. For modern AI data centers and cloud data centers that deploy petabyte-scale storage, the bandwidth per terabyte offered by an all-QLC flash array is far greater than that of an HDD-based array. This means that reading the same size data block takes considerably less time with a full-flash array, resulting in storage infrastructure that can keep pace with the GPUs and reduce GPU idle time.

High-capacity SSD DRAM size challenges

QLC SSDs offer numerous advantages. However, with higher QLC capacities the Indirection Unit (IU) size is also gradually growing, evolving from the traditional 4K to 16K to 64K. This is because the DRAM size in enterprise-grade SSD controllers is limited by the power budget, and DRAM is typically used to store the Flash Translation Layer (FTL) table, which maps logical addresses to physical addresses. As disk capacity grows, the number of FTL entries increases, requiring more DRAM space. This poses challenges in terms of cost and implementation. Consequently, many manufacturers have opted to increase the IU size to reduce the number of FTL entries thereby keeping the effective amount of DRAM capacity used by the drives similar to lower capacity SSD counterparts.

QLC SSD IU size has been increasing but the software stack has not adapted to this change. It is still operating on the principle of traditional storage designed with smaller indirection unit. This mismatch introduces potential for higher Write Amplification Factor (WAF) and, consequently, negatively impacts the lifespan of QLC SSDs. For instance, if an application needs to write a 4K data block to an SSD with a 64K IU size, the actual amount of data written is 64K, resulting in a WAF of 64/4 = 16 for small writes. Since OS memory pages are usually 4K in size, lack of optimization means each 4K write amplifies data volume by 16x, hurting endurance and speed.

Storage software vendors have been optimizing their storage stack to align with the IU size of large-IU SSDs. To make this capability available to small and medium-sized enterprises (SMEs), Solidigm has open-sourced the write-shaping feature via Cloud Storage Acceleration Layer (CSAL), aiming to accelerate the implementation of high-capacity QLC SSDs.

CSAL host-based FTL for IU alignment

CSAL is a host-based Flash Translation Layer designed to optimize the performance and endurance of high-density NAND flash SSDs, particularly in mixed workload scenarios. 

CSAL with tiered storage architecture, using Solidigm D7-PS1010 high-performance storage media as a front-end cache and write buffer. Figure 2. System architecture of the CSAL software

As shown in Figure 2, CSAL employs a tiered storage architecture, utilizing high-performance storage media (such as Solidigm D7-PS1010) as a front-end cache and write buffer. It converts small random write operations from the host into large sequential write operations for the underlying high-capacity QLC SSDs. During this process, IU alignment is achieved by aggregating small writes from the application layer into block sizes equal to or larger than the indirection unit of the QLC SSDs. This conversion significantly reduces WAF, extends flash memory lifespan, and improves throughput while reducing latency.

Read operations can be either served from cache or QLC leveraging the TLC equivalent high read bandwidth offered by the QLC SSDs.

Storage requirements vary significantly across different stages of AI pipeline. For example, during the AI training process, a large amount of temporary data, such as checkpoint files, is frequently saved. During this phase, storage devices need to handle a large volume of sequential write operations of large blocks of data, a scenario where QLC SSDs are a good fit. However, in the AI inference stage, Retrieval Augmented Generation (RAG) is widely adopted which leads to vector databases being heavily used. At this point, storage devices need to handle many small, random write operations.

To better serve diverse I/O size application need with frequent changes in data I/O patterns, we are introducing DPU-native flexible disaggregated storage solution with CSAL.

DPU-native flexible disaggregated storage solution with CSAL

CSAL offloaded to the DPU for flexible storage solution sample of storage and compute disaggregation with separate storage nodes for cache and capacity drives. Figure 3. CSAL Flexible Storage Solution

In Figure 3, we show how CSAL is offloaded to the DPU to achieve a flexible storage solution. This sample solution shows storage and compute disaggregation with separate storage nodes for cache drives and for capacity drives. 

  1. Storage nodes are exposed as NVMe-oF targets through the DPU. 
  2. CSAL software is installed on the DPU using DPU APIs.
  3. CSAL utilizes DPU resources like DPU DRAM to store logical-to-physical (L2P) mapping and DPU cores to run the FTL.
  4. DPU offloading frees the host DRAM and CPU resources otherwise consumed by storage processing. 
  5. CSAL abstracts the cache and capacity drives from the host and exposes them to the application as remote storage via NVMe-oF. 

This section will discuss the component of this solution further.

1. Data processing unit (DPU)

The compute resources on the DPU card can effectively support the storage computation needs of lightweight CSAL software, such as FTL’s L2P table lookups and user data block merging. Additionally, DRAM resources on the DPU card can be used to store a portion of the FTL table’s contents. By utilizing the resources on the DPU card, CSAL successfully reduces the occupation of the host CPU and memory, allowing them to be better utilized for business-related services. Furthermore, if high-end DPUs are equipped with larger DRAM memory resources available for storage processing, we can modify the configuration options within the CSAL software to use DPU DRAM instead of high-performance cache disks to fulfill the caching functionality within the software. This design offers flexibility in memory utilization and configuration.

2. Storage node

The storage nodes are composed of all-flash arrays with solid-state drives, which can be a mix of high-speed TLC cache disks and high-density QLC capacity disks. Each storage node will have several DPU cards. These provide NVMe-oF based targets to the host. The storage nodes adopt a multi-path access mode to connect to the storage network. This not only improves data I/O bandwidth but also prevents single points of failure.

3. Compute node

Typically, compute node servers are equipped with a large number of GPUs and DPUs to meet high computational demands. DPU infrastructure is gaining adoption, though in the early stages. While DPU processing power is underutilized today, many applications are evolving to take advantage of the processing offered by DPU to offload the host compute. CSAL does exactly that. We moved the CSAL software module from the compute node's CPU to the DPU card. This allows for faster data processing during network transmission and maximizes the utilization of DPU resources that were otherwise idle.

4. Software-defined flexible storage

Empowered by DPU hardware and leveraging the NVMe-oF protocol, CSAL can span from a single-machine to a flexible, cross-network storage solution. All-flash array (“just a box of flash” or “JBOF”) storage nodes provide several NVMe-oF targets over the network. Data center management software can flexibly configure storage capacity and features according to the different stages and requirements of artificial intelligence computing.

Examples of CSAL leveraged for the AI pipeline

In this section, we will go through a few examples to illustrate how the flexibility of CSAL storage can be leveraged for various stages of the AI pipeline. AI computing stages have different IO patterns associated with them. In the examples given below, we show how CSAL can be used to create diverse volumes supporting requirements of various stages of the AI pipeline.

Example 1

In this example, we’ll assume that CSAL FTL is deployed on Solidigm 122TB QLC SSDs. If a compute node requires a 200TB storage space for the data ingest and checkpointing phases, the management software can instruct CSAL to create a virtual disk composed of two NVMe-oF targets on the compute node's DPU. Since these stages involve large sequential writes, cache disks are optional. DRAM in the DPU can be utilized to merge the small fraction of 4K data blocks issued by the filesystem into large sequential blocks before writing them directly to the QLC high-capacity disks.

Example 2

In this example, let’s assume that the compute node is an inference server. Inference phase has I/O patterns consisting of mixed read and write accesses that are small and random in nature. Retrieval Augmented Generation (RAG) is one inference technique. Vector databases are a core component of RAG systems. To create 200TB of storage space to store vector database, the management software can create a CSAL virtual disk on the compute node's DPU consisting of a 16TB cache target and two 122TB QLC targets. This is then used to store the vector database. Since the I/O pattern of the vector database involves small-block random read/write, the high-performance target used for caching will manage aligning the small I/O to the indirection unit and merge them into large data blocks before writing to the subsequent QLC targets.

By introducing DPU and CSAL software, data centers can utilize all storage disks more efficiently and can flexibly configure, create, and release storage resources as needed.

Test topology

To demonstrate the performance improvements achieved through I/O alignment to the indirection unit of QLC SSDs and the flexible storage solution with CSAL based on DPU, we set up the following test environment in our lab, as shown in Figure 4.

Testbed for flexible storage solution with CSAL based on DPU. Figure 4. CSAL+DPU Testbed

In Figure 4 , ① and ② are high-capacity QLC solid-state drives, and ③ is a high-performance TLC cache disk. ④ and ⑤ are the virtual NVMe-oF disks created by combining the underlying physical disks through the DPU network and network switch.

In the test environment, the DPU on the storage node is used to transform the solid-state drives into NVMe targets, which are then exposed to the compute node via the network. For comparison testing, we mapped ① (the high-capacity QLC solid-state drive) as a standalone virtual disk ④; and mapped ② and ③ into a virtual disk ⑤ using CSAL software. We then used the common storage stress test software FIO to conduct storage performance tests on both virtual disks.

During the testing process, we varied the write data block size, write queue depth, and write pattern in different combinations to address the block size requirements of various artificial intelligence stages. For example, the data preparation and checkpointing processes involve large data block writes, corresponding to test #1; while vector database and write-ahead logging applications involve frequent 4K random data writes, corresponding to test #4. We show the test results for the testing in Figure 5 below.

Test results

DPU-native flexible storage solution performance tests on a standalone high-capacity QLC disk and the CSAL virtual disk composed of a high-speed cache disk and a high-capacity QLC disk. Figure 5. Test Data Comparison

Based on the aforementioned test environment, we conducted five performance tests on both a standalone high-capacity QLC disk and the CSAL virtual disk composed of a high-speed cache disk and a high-capacity QLC disk. The I/O patterns of these five tests correspond to several different stages in artificial intelligence computing:

  • Data cleaning

  • Checkpoints during training

  • Object storage

  • Write-ahead logging

  • Data preparation 

  • RAG

The test results show that for small 4K writes, whether random or sequential, the CSAL virtual disk demonstrates approximately a 20x performance improvement compared to the standalone QLC SSD.  In the non-uniform random distribution zipf test, due to the caching effect of hot data, the CSAL solution also exhibits a 15x performance improvement over the standalone QLC solid-state drive.

It is important to note that the results with standalone QLC SSDs are still far superior compared to using HDDs since Gen4 QLC SSDs have more than 10x the write performance as that of an HDD. Our testing shows the importance of using a modern storage software solution like CSAL that can take advantage of the underlying raw performance of solid-state storage to meet the demands of the AI infrastructure while allowing for storage capacity scaling and flexibility of configuration at the same time.

Conclusion

As AI applications continue to increase, bringing large amount of data to the compute rapidly makes fast and flexible storage a crucial component of AI infrastructure. The latest AI infrastructure proposes storage-compute separation, and inference applications propose prefill-decode (PD) separation, both of which place high demands on storage flexibility. 

Our results demonstrate that in AI inference applications, the combination of CSAL and DPU can seamlessly support the storage performance and capacity requirements of object storage and vector databases. The CSAL plus DPU solution aims to enable more flexible allocation of storage resources on demand in large-scale data centers, maximizing customers' investment in network and storage. 


About the Author

Wayne Gao is a Principal Engineer and Solution Storage Architect at Solidigm. He has worked on Solidigm’s Cloud Storage Acceleration Layer (CSAL) from pathfinding to commercial release. Wayne has over 20 years of storage developer experience, has four U.S. patent filings/grants, and is a published EuroSys paper author.

Li Bo serves as a senior storage solutions architect at Solidigm. With over two decades of experience in system design and development across multiple organizations, he specializes in optimizing the performance of networked and storage solutions. In recent years, he has concentrated his efforts on advancing the industry-wide adoption of non-volatile storage technologies.

Sarika Mehta is a Senior Storage Solutions Architect at Solidigm, bringing over 16 years of experience from her tenure at Intel’s storage division and Solidigm. Her focus is to work closely with Solidigm customers and partners to optimize their storage solutions for cost and performance. She is responsible for tuning and optimizing Solidigm’s SSDs for various storage use cases in a variety of storage deployments that range from direct-attached storage to tiered and non-tiered disaggregated storage solutions. She has diverse storage background in validation, performance benchmarking, pathfinding, technical marketing, and solutions architecture.

Notes

1.https://www.seagate.com/content/dam/seagate/en/content-fragments/products/datasheets/exos-m-v2/exos-m-v2-DS2045-3-2506-en_US.pdf]

2. https://www.solidigm.com/products/data-center/d5/p5336.html#form=U.2%2015mm&cap=122.88TB

Appendix A: QLC power efficiency versus HDD

Source – Solidigm, October 2024. Other information resources as noted below.

Scope: Power consumption analysis assumes a green field (new) bottom-range Hyper-scaler / Tier 2 AI data center implementation utilizing leading-edge power and space optimizations. Key model parameters as follows:

  • Assume 50MW of total available data center energy capacity

Appendix B: Test system configuration

Test System Configuration
System

Manufacturer: AMD

Product Name: AMD EPYC 7542 32-Core Processor 

BIOS Vendor: AMD Corporation 
CPU AMD EPYC 7542 32-Core Processor 
1 x sockets @2.9GHz, 32 cores/per socket 
NUMA Nodes 4
DRAM Total 440GiB DDR4@ 2667 MT/S
OS Rocky Linux 9.6
Kernel Linux d3.d3 5.14.0-503.14.1.el9_5.x86_64 
 SSD

4 x Solidigm D5-P5336 15.36TB, FW Rev: 5CV10302, PCIe Gen4x4 

4 x Solidigm D7-PS1010 3.84TB, FW Rev: G70YG030, PCIe Gen5x4 

Fio Version: 3.39
DPU MT43244 BlueField-3

Table 2. System configuration