Platform Optimization for Performance and Endurance

Solidigm QLC SSDs + Wiwynn platform with cloud storage acceleration layer (CSAL)

Background

Cloud Storage Acceleration Layer (CSAL) has been in production for several years now.1 It was deployed to accelerate the adoption of high density Solidigm QLC SSDs into customer solutions. The Solidigm high density QLC drives come with larger indirection unit sizes that are greater than the typical indirection unit size of 4KiB. Over the years, the host storage stack has been optimized around this 4KiB indirection unit. When indirection unit aligned host IO is written to NAND media in the SSD, firmware writes the block without any additional operations. When it is unaligned, NAND physics requires the firmware to perform a read-modify-write. This can cause multiple blocks to be read, modified, and written back and is expressed by write amplification (WAF). This problem exists even with 4KiB indirection drives but becomes more pronounced with larger indirection unit sizes like 8KiB, 16KiB, 64KiB etc. due to the host stack optimization around 4KiB. 

While Solidigm is a leader in QLC SSD development and is on its 4th generation of mature QLC technology with drive capacities up to 122.88TB,2 other SSD vendors are also entering this market. The storage industry is facing a unique challenge. As SSD capacities get larger, so does DRAM on the SSD to maintain the Flash Translation Layer (FTL). The FTL size scales with SSD capacity. This would be fine in a world without power budget and price sensitivity. However, added DRAM will increase SSD price and power consumption. This impact is significant and only increases as SSD capacity increases. To reduce the FTL size, SSD vendors use larger indirection unit which helps limit the FTL size and therefore assists in meeting power requirements which keeps the DRAM cost in check. It is an astonishing feat to be able to provide 8x more capacity of 122.88TB in the same power budget and form factor as a 15.36TB SSD.

The WAF challenge

This brings us to the WAF challenge. WAF impacts performance of the drive as each host IO results in many NAND IOs in the background and negatively impacts the endurance of the drive. One of the features that CSAL offers is a host-based FTL that leverages faster media, like Intel® Optane™ SSD, SLC, or TLC SSDs for caching host IO. This IO is later written to the QLC drives in large sequential, indirection unit friendly IO sizes. Using a higher performance caching device allows higher host write performance. Additionally, CSAL maintains a WAF of very close to 1 on the QLC drives. This is a result of using large sequential writes to QLC SSDs in the storage tier, enhancing their endurance. As a result, the customer gains the capability to achieve the desired performance, capacity, and endurance from its storage. CSAL provides a unique solution that satisfies the need for dense capacity with lower power consumption without having to modify host stack.

By incorporating CSAL in the host stack, host can leverage the flexibility offered by CSAL to tune hardware to meet application requirements. As Intel phased out Optane media that was used for caching in many deployments, many customers needed an alternative. Solidigm bridged this gap by offering Gen4 SLC Solidigm™ D7-P5810 SSD. As Gen5 SSDs become mainstream, leveraging the Gen5 bandwidth for write-shaping and host IO caching presents an interesting option. 

Solidigm teamed up with Wiwynn to explore the application of Solidigm D7-PS1010 Gen5 TLC as cache device along with Solidigm D5-P5336 as storage device in CSAL deployment within a Wiwynn platform. Wiwynn is an IT infrastructure provider for cloud, data center, and enterprise market.  Wiwynn is engaged in identifying, designing, and productizing innovative solutions to improve operational efficiencies for its customers. This is aimed at providing customers with a platform level solution that would enable the adoption of high-density QLC drives without requiring major host stack changes in order to realize the benefits of high density drives for storage solutions. 

Data points provided in this paper were collected with sample deployments to show example datapoints of the benefits that can be leveraged. Platform configuration are variable and can be developed to fit exact performance, power, and budget needs of customers.

QLC total cost of ownership (TCO) advantage

Incorporating QLC drives in a given deployment requires an upfront effort. Is it worth making this investment? The data speaks for itself. Solidigm used the Solidigm TCO estimator tool to evaluate the TCO of Solidigm QLC SSDs. The tool shows that there is an overall 40% TCO advantage with QLC drives when compared to an all-HDD storage array of 10PB capacity for an AI data pipeline solution. This significant cost reduction opportunity is realized through factors such as space compaction, better power density, and reduced operational expenses as seen in Figure 1. See the appendix for TCO configuration and assumptions. 

CSAL expedites QLC adoption by reducing the effort that host stack needs to make in order to get to a QLC deployment at scale.

Figure 1. Solidigm D5-P5336 vs. HDD TCO

CSAL architecture

Feature overview

The Cloud Storage Acceleration Layer (CSAL) extends Storage Performance Development Kit (SPDK) to provide software that enables optimized high density QLC SSD deployment. In addition to write shaping, CSAL delivers a rich number of features that allow customers to build an advanced storage stack (Figure 1) and achieve performance and a TCO solution that fits the application needs. Additional features include append-cache and RAID5F. More features that are in planning include data placement to improve WAF by adopting NVMe Flexible Data Placement (FDP) standard specifications, NVMe streams, compression, and deduplication to reduce user data which will enable larger amount of data to be stored on the storage volumes. CSAL can be deployed in RAID or non-RAID configurations.

Write shaping sequentializes user workloads to QLC SDDs while leveraging both QLC’s high sequential bandwidth and its dense capacity. It provides scalable performance when running CSAL FTL on multiple CPU cores, which is required when CSAL works with RAID volumes. 

To guarantee robust data redundancy, CSAL can be deployed on top of RAID volumes including RAID5F. RAID5F is RAID5 which accepts only full stripe writes. It gets rid of the RAID write hole exposure and avoids read-modify-write when the RAID parity needs to be recalculated and updated. CSAL’s append-cache reuses FTL logic to provide a client-side ultra-fast cache. It maintains cache device WAF at the level of 1.0 and, thanks to that, delivers high performance and quality of service (QoS).

In addition to the above features, users can benefit from robust and mature SPDK infrastructure itself. Users can tap into SPDK features and services to compose full stack storage solutions like leveraging NVMe over Fabric (RDMA/TCP) and vhost targets, using various storage device drivers and connectivity protocols (NVMe PCIe drive, initiators of NVMe-oF (RDMA/TCP), or io_uring). The configured storage stack can be consumed by applications which use SPDK directly, or over block interface, or over fabrics.

In the past, Solidigm has delivered proof-points with several Solidigm SSD product lines used with various CSAL features like Solidigm D7-P5810 used for caching and Solidigm D5-P5336 for base storage. See the Solidigm Tech Field Day presentation recording for more information: Solidigm Presents the Cloud Storage Acceleration Layer (CSAL) Advantage.

CSAL functional overview with Solidigm cache storage and capacity storage options. Figure 2. CSAL functional overview

Architecture

CSAL FTL architecture (Figure 2) combines a high-performance storage class memory SSD as a write buffer cache with a large-capacity QLC SSD. CSAL FTL exposes a block device volume to the user while interfacing with underlying devices using block API. Application does not need to change in order to consume underlying hardware. 

CSAL flash transition layer (FTL) architecture with a persistent write buffer. Figure 3. CSAL FTL architecture

CSAL FTL intercepts a part of the user workload that is smaller than the indirection unit of the QLC SSD and stages it on the cache tier first. These are generally small user writes and the lifetime of this data is short. When the cache becomes full, aggregated user data is moved into QLC. The data compaction process generates an IO pattern that has optimal write size (large sequential write) for the QLC SSD. During the compaction there is a reduction of writes that gets written to the QLC SSD as short lifetime data gets invalidated on the cache SSD itself, reducing the amount of data written to the QLC SSD. Respectively, once the capacity tier becomes full, CSAL FTL executes algorithms to free up space within the storage layer. 

CSAL FTL uses the cache as a tier for Metadata and L2P table. CSAL implements two-level L2P Mapping to reduce the memory footprint. The mapping provides high-performance fine-grained (4KB) data access, while maintaining large IO size to the QLC storage tier reducing the impact on application performance and storage endurance. This caching tier mechanism consequently provides data protection from abrupt power loss events.

Deployments

CSAL FTL has been successfully deployed across thousands of servers, demonstrating its scalability and practical applicability. It is open-source and hardware-agnostic, making it a flexible solution for a variety of data center architectures. In this section, we show different sample deployments for CSAL to give our customers an idea of the options available for their solution needs.

A sample basic deployment scenario is illustrated in Figure 3. In this setup, CSAL FTLs and virtual machines (VMs) run on the same server. Each CSAL FTL instance accesses a pair of devices: one for caching and one for capacity. The CSAL FTL volume is exposed to virtual machines using the vhost target. This provides application transparency, allowing workloads to run without having to manage underlying storage backend complexity.

CSAL FTL basic deployment scenario with Solidigm cache storage and capacity storage. Figure 4. CSAL FTL basic deployment scenario

Another example configuration with a more advanced deployment with disaggregated storage is shown in Figure 4. Storage is provided by a dedicated server running the SPDK application, which is configured to operate multiple CSAL FTLs. In this scenario, to enable logical volume management, an SPDK logical volume is created on top of the FTLs. This setup allows for the creation of multiple logical volumes, customization of their sizes, overprovisioning of the space, and the ability to clone or snapshot volumes. Logical volumes are accessed by virtual machines or applications via NVMe-oF.

CSAL FTL disaggregated deployment use case with Solidigm cache storage and capacity storage. Figure 5. CSAL FTL disaggregated deployment use case

CSAL offers flexibility in deployments with superior scalability and cost efficiency. CSAL architecture effectively utilizes high density QLC SSDs, making it a promising solution for next generation high density scalable storage. For more detailed information on CSAL, visit the Solidigm CSAL page.

Evolution from Intel® Optane™ SSD to Solidigm™ D7-PS1010 Gen5 TLC SSD

Intel's Optane SSD has long been celebrated as a groundbreaking storage innovation, renowned for its exceptional performance and versatility. It still remains one of the best-in-class PCIe Gen4 SSD. However, the Intel® Optane™ SSD P5800 series has reached the end of its life.3 Fortunately, Solidigm’s relentless innovation has paved the way for worthy successors: D7-P5810 Gen4 SLC and the D7-PS1010 Gen5 TLC SSD. In this paper we will focus on Solidigm™ D7-PS1010. This SSD is poised to replace the Optane SSD, particularly for CSAL workloads, delivering remarkable performance improvements at a fraction of the cost.

The CSAL software architecture, which employs write shaping or append caching on the cache SSD, transforms the workload into a highly efficient sequence of writes. This optimization allows for Write Amplification Factor (WAF) on the cache device to remain close to 1.0x, maximizing the bandwidth available for user applications. Our evaluation of the Solidigm D7-PS1010 Gen5 TLC SSD demonstrates up to 90% performance improvement over the Optane SSD as a cache drive, achieved roughly at half the price point. This breakthrough underscores the Solidigm commitment to advancing storage technology, offering a cost-effective, high-performance alternative that meets the demands of modern workloads.

Workload Description Spatial Locality Random Internal Fragmentation  
(Sequential)
Internal Fragmentation 
(Random)
Temporal Locality  
(low)
Temporal Locality  
(high)
 CSAL 
Optane as cache SSD
1.0x 1.0x 1.0x 1.0x 1.0x 1.0x
 CSAL 
PS1010 as cache SSD
1.05x 1.9x 0.97x 1.32x 1.86x 1.46x

Table 1. Performance comparison D7-PS1010 as Optane Cache SSD replacement

Endurance management with CSAL

In this section, we demonstrate how an unoptimized stack using QLC SSDs can impact the endurance of the drive via unaligned IO. We ran the same set of workloads, first with a QLC drive attached to the server without any storage software to shape the IO, and second with the platform level solution built with CSAL deployed on Wiwynn’s platform. 

In case of large sequential host write workloads or read workloads, QLC’s write amplification is as expected, but as we switch to random workloads, we see a high write amplification. These random workloads can wreak havoc on the performance and endurance of a QLC SSD. Figure 6 shows how QLC is well suited for inherently sequential workloads. As you can see in the results in Figure 6, the impact of CSAL’s architecture addresses the write amplification issue discussed in the earlier sections of this article.

Write Amplification Factor (WAF) comparison between Wiwynn Platform with CASL QLC WAF vs unoptimized platform QLC WAF Figure 6. Write amplification factor comparison

Wiwynn Platform

Wiwynn offers 2U ORV3 servers specifically designed for enterprise storage, including models that can accommodate up to twelve (12) U.2 NVMe SSDs. This configuration provides a high-performance all-flash or hybrid high-density storage solution within the Open Rack V3 standard, leveraging the low latency and high throughput of NVMe for demanding applications. These servers are built with Wiwynn's commitment to OCP principles, offering benefits like open designs, power efficiency, and ease of serviceability with front-access drive bays. This makes them ideal for enterprise data centers seeking agile, high-performance storage that aligns with open hardware initiatives and delivers excellent total cost of ownership. See details of the server configuration in the appendix.

Conclusion — Solidigm and Wiwynn collaboration

As an innovative cloud IT infrastructure provider of high-quality computing, storage products, and rack solutions for leading data centers, Wiwynn is always exploring ways to bring new technologies to the data center. The focus areas may be around more efficient thermal solutions, mechanisms to reduce Opex and Capex as well as bringing new technologies for newer and emerging use cases for cloud service provider customers.  With Solidigm, Wiwynn has collaborated to explore dense storage solutions, while minimizing the impact of the software stack while enhancing the utilization and longevity of the QLC drives in the data center.


About the Authors

Wiwynn Team

Chandra Nelogal & Ted Pang

Solidigm Team

Jim Lin, Lucas Ho, Wayne Gao, Bo Li, Sarika Mehta, Mariuz Barczak & Yvette Chen

Test steps and configuration

Test System Configuration – Solidigm™ D7-PS1010
System

Manufacturer: Wiwynn

Product Name: Wiwynn ODM Server

BIOS

Vendor: American Megatrends International, LLC.

Version: C3010AM01

Motherboard AMD Genoa SP5, single socket
PCIe

Option connectors on MB:

Slot#1 Gen 5 x16 (FHFL, front), Slot#2 Gen 4 x 16 (FHHL, front)

Slot#3 Gen 4 x16 (FHHL, internal), Slot#1/#3 FHHL/FHHL coexist

Ethernet 1pc 100GbE Broadcom OCP NIC card
CPU

Vendor ID:         AuthenticAMD

Model name:        AMD EPYC 9454 48-Core Processor

Thread(s) per core:   2

Core(s) per socket:   48

Socket(s):            1

Stepping:             1

Frequency boost:      enabled

CPU max MHz:          3810.7910

CPU min MHz:          1500.0000

NUMA Nodes 1
DRAM 12pcs 64GB DDR5, 1DPC , 4800MT 
OS Ubuntu 22.04.4 LTS
Kernel Linux c3010a 5.15.0-139-generic
Power 48V DC bus bar power solution with PDB
Chassis Dimensions 2OU height 21” ; 92.5mm (H) * 537mm(W) * 801.5mm (D) 
SCM

DC-SCM v1.0

Support dedicated BMC management port

RACK ORV3
NVMe Slots

2 x M.2 (boot)

12 x U.2 carriers, configurable

Test Drives

1 x Solidigm D5-P5336 30TB, FW Rev: ACV10340, PCIe Gen4x4

1 x Solidigm D5-P5336 60TB, FW Rev: 5CV10077, PCIe Gen4x4

1 x Solidigm D7-PS1010 3.84TB, FW Rev: G75YG100, PCIe Gen5x4

Table 3. Wiwynn ORV3 configuration 

Icelake System Configuration – Intel® Optane™ P5800
System

Manufacturer: Supermicro

Product Name: SYS-220U-TNR

BIOS

Vendor: American Megatrends International, LLC.

Version: 1.4b

CPU

Intel(R) Xeon(R) Platinum 8368Q CPU @ 2.60GHz

2 x sockets @2.6GHz, 38 cores/per socket

NUMA Nodes 2
DRAM Total 256G DDR4@3200 MHz
OS Fedora Linux 40 (Server Edition)
Kernel 6.13.10-100.fc40.x86_64
 SSD 1 x Intel® Optane™ P5800 800GB, FW Rev: L0310600, PCIe Gen4x4 
1 x Solidigm D5-P5336 15.36TB, FW Rev: 5CV10077, PCIe Gen4x4

Table 4. Test System Configuration  

 

Workload Workload Description
8 write jobs: 64K/seq/w/qd128 Spatial locality
8 write jobs: 64K/rand/w/qd128 Random
8 write jobs: 4K/seq/w/qd128 Internal Fragmentation
8 write jobs: 4K/rand/w/qd128 Internal Fragmentation
8 write jobs:64K/zipf0.8/w/qd128 Temporal Locality (low)
8 write jobs:64K/zipf1.2/w/qd128 Temporal Locality (high)

Table 5. Workload Configuration 

Open source CSAL download

To try out the open-source version of CSAL, please follow the instructions below. For additional feature information and support, contact dl_csal@solidigm.com

Requirements

In order to use CSAL, a cache device (e.g. SLC/TLC SSD) with at least 5GiB capacity is required. The drive must be formatted to 4KiB block size. 

The base device (e.g., QLC) requires at least 20GiB of capacity. The drive must be formatted to 4KiB block size. 

You can determine the NVMe format using nvme-cli:

nvme list
TBD Sample output in the above example the two drives are formatted to 4KiB block size.

Steps

1. Clone fio on your machine and build it.

git clone https://github.com/axboe/fio.git

cd fio

./configure

make -j 

2. Setup your CSAL/SPDK repository:

git clone https://github.com/spdk/spdk

cd csal

git submodule update --init

./scripts/pkgdep.sh 

./configure --with-fio=/fiopath/

make

3. Configure and build the SPDK repository with the fio plugin.

./configure --with-fio=/root/fio

make -j

4. Identify the drives' PCIe addresses:

nvme list -v
In the above example the addresses for the two NVMe devices are 0000:66:00.0 and 0000:65:00.0.

5. Switch drivers to user space driver (if you're doing this the first time after boot, this also assigns hugepages):

./scripts/setup.sh

a. The default number of assigned hugepages (2GiB) may be too small for a full configuration of FTL (i.e. utilizing the whole NVMe devices). To increase the hugepage amount, use the HUGEMEM option:

HUGEMEM=8192 ./scripts/setup.sh

b. If you wish to later switch back the NVMe devices back to the kernel driver, use the following command (note that this doesn't unassign the hugepages):

./scripts/setup.sh reset

c. If you wish to unassign hugepages you can use the CLEAR_HUGE option:

CLEAR_HUGE=yes ./scripts/setup.sh

6. Execute the target SPDK application (in this example we're using spdk_tgt, but nvmf_tgt, vhost and iscsi_tgt work as well)

./build/bin/spdk_tgt

7. In another terminal attach the NVMe drives, using the PCIe addresses reported by setup.sh. Note that any bdev names (like Base0) are purely for identification purposes and can have arbitrary values.

./scripts/rpc.py bdev_nvme_attach_controller -b Base0 -t pcie -a 0000:65:00.0

./scripts/rpc.py bdev_nvme_attach_controller -b Cache0 -t pcie -a 0000:66:00.0

Executing the above command will print the detected NVMe namespaces, each being surfaced as a bdev (note the n1 suffix):

8. (Optional) Split the NVMe devices into smaller partitions (smaller cache devices leads to faster first FTL creation), minimum cache size is 5GiB, minimum base device is 20GiB:

./scripts/rpc.py bdev_split_create -s 5120 Cache0n1 1

./scripts/rpc.py bdev_split_create -s 102400 Base0n1 1

9. Create CSAL FTL bdev

./scripts/rpc.py -t 600 bdev_ftl_create -b Ftl0 -d Base0n1p0 -c Cache0n1p0

a. If you wish to load an FTL instance, rather than create a new one, use the -u option:

./scripts/rpc.py -t 600 bdev_ftl_create -b Ftl0 -d Base0n1p0 -c Cache0n1p0 -u 04239e1f-b665-4351-9f22-1eb484c5db76

10. Save the bdev configuration for future use

./scripts/rpc.py save_subsystem_config -n bdev > fio_config.json

Sample drive preparation and benchmark configuration

Before testing, the SSD under test is typically subjected to two full disk write operations for pre-conditioning using large sequential writes. The fio configuration file is as follows:

[global]

ioengine=spdk_bdev

direct=1

thread=1

buffered=0

norandommap=1

randrepeat=0

scramble_buffers=1

numjobs=1

size=100%

 

[POR]

bs=128k

rw=write

iodepth=128

After completing the preconditioning, adjust the write mode, queue depth, and write block size according to the specific requirements of the test item to do more rounds workload precondition then complete the test.

Taking Test1 as an example, with sequential writes, a queue depth of 128, and a block size of 64K, refer to the configuration file below:

[global]

ioengine=spdk_bdev

direct=1

thread=1 

buffered=0

norandommap=1

randrepeat=0

scramble_buffers=1 

 

[POR]

bs=64k

numjobs=1

rw=write 

iodepth=128

size=100%

Solidigm™ D5-P5336 TCO and sustainability calculations

TCO calculations based on internal Solidigm TCO estimator tool(https://estimator.solidigm.com/ssdtco/index.h support Solidigm D5-P5336 TCO calculations after product launch. 
 
SOLIDIGM D5-P5336 VALUE VS. ALL-HDD ARRAY

All-QLC configuration:  
Capacity - Solidigm D5-P5336, 61.44TB, U.2, 7000 MB/s throughput, 25W average active write power, 5W idle power, 95% capacity utilization, RAID 1 mirroring, calculated duty cycle 8.9%. Solidigm D5-P5336 61.44TB Standard Distribution Cost used for calculations. Consult your Solidigm sales representative for latest pricing. 

All-HDD configuration: 

Key Common Cost Component Assumptions: Power cost = $0.15/KWHr; PUE factor = 1.60; Empty Rack Purchase Cost = @1,200; System Cost = $10,000; Rack Cost for Deployment Term = $171,200 Capacity 
Capacity – Seagate EXOS X20 20TB SAS HDD ST18000NM007D (datasheet), 9.8W average active power, 5.8W idle power, 70% short-stroked throughput calculated to 500 MB/s; Hadoop Triplication, 20% duty cycle. Price based on ServerSupply 20TB pricing as of July 10, 2023, $0.0179/GB https://www.serversupply.com/HARD%20DRIVES/SAS-12GBPS/20TB-7200RPM/SEAGATE/ST20000NM002D_352787.htm?gclid=CjwKCAjw2K6lBhBXEiwA5RjtCS8vuehEnT0SeCvDB95Y0l7X-Ho2VUmMYoVZ600X0sRdGtaieugddBoCv9QQAvD_BwE 

Key Common Cost Component Assumptions: Power cost = $0.15/KWHr; PUE factor = 1.60; Empty Rack Purchase Cost = @1,200; System Cost = $10,000; Rack Cost for Deployment Term = $171,200