In storage markets, there is a shift from focusing on performance or capacity to emphasizing performance per watt and capacity per watt. This shift toward power arrived almost overnight with rapidly expanding AI deployments. Today’s global power infrastructure is stretched to its limits and companies are actively seeking additional power sources, even nuclear power, to keep the AI revolution going and to power both existing and new data centers. With this shift QLC technology, with its ability to deliver higher storage capacities per drive and per watt, is gaining significant momentum. Companies are developing infrastructure to consume increasingly higher storage capacities. As drives become larger, new concerns must be addressed, in particular reliability and recovery time in case of failure. These are valid concerns and are a priority for Solidigm and Xinnor to answer.
In our previous work available here Optimal Raid Solution with Xinnor xiRAID and High Density Solidigm QLC Drives, we showcased RAID5 performance using high-density 61.44TB QLC SSDs, the Solidigm™ D5-P5336,1 with Xinnor xiRAID,2 a software-based RAID solution specifically designed to handle the high level of parallelism of NVMe drives. With software RAID’s reliability and QLC capacity, we demonstrated a solution that would work for any deployment that is looking for massive capacity without compromising reliability. The performance and write amplification factor (WAF) findings in a RAID5 configuration are published in the above-referenced solution brief.
In this document, we address concerns about rebuild time for high-capacity QLC drives by measuring how long it takes to rebuild a drive within the RAID array. We also examine the write amplification during the rebuild process.
Write amplification factor is a well-known measure of NAND SSD endurance. Host writes and NAND writes are captured via SSD SMART logs before and after the workload to compute WAF. In our opinion, it is also a measure of the storage stack efficiency. WAF is the ratio of the number of NAND writes to the number of host writes for a given application workload. The following formula is used to calculate WAF:
WAF = NAND writes for the workload / Host writes for the workload
In an ideal situation, one host write would result in exactly one NAND write, leading to a WAF of 1. However, many factors cause multiple NAND writes for each host write. QLC drives use a higher indirection unit that can range from 8K to 64K. The indirection unit increases with larger capacity.
There are two primary reasons for this: cost and power. First, maintaining on-SSD DRAM size equal to that of lower-capacity drives provides a cost advantage for QLC drives. Second, it allows high-density QLC drives to fit the same power budget as lower-capacity drives. These factors, combined with read performance that saturates Gen4 link and write performance that is orders of magnitude better than HDDs, offer customers an ability to upgrade their storage infrastructure to meet their AI ambitions and still gain a total cost of ownership (TCO) advantage.
Zooming in on the NAND media, when host writes in IO size that matches the indirection unit size and the IO is aligned to the indirection unit, the IO is written to the physical NAND media, which means no additional writes are incurred. However, when either 1) the IO size is different from the indirection unit size, or 2) when the IO is unaligned, NAND physics require a block to be read first, then the portion where the IO will reside is modified and finally it is written back to the media. This is called a read-modify-write operation and is the main cause of higher write amplification. Host stacks are generally optimized around 4KB IO sizes and writing these smaller IOs to large indirection unit can lead to high write amplification. Read-modify-writes also consume performance cycles that could otherwise be used to service the host application workload.
For a more detailed exploration of this topic, please refer to the above-referenced solution brief. To summarize the findings of that solution brief, we showed how to optimize application workloads and how xiRAID’s architecture, thanks to its flexibility to vary chunk size, managed to incur minimum read-modify-write cycles on QLC drives and, in return, maximized the performance potential of the RAID array.
A parity RAID is a RAID array which uses parity drives as a protection mechanism to provide fault tolerance and data redundancy. RAID5 is an example of a parity RAID with one parity drive, while RAID6 has two parity drives. There are other parity RAID configurations, but RAID5 and RAID6 are the most commonly used. Parity calculations for parity RAIDs (such as RAID5, RAID6 and others) are based on the mathematical formulas of RAID syndrome calculations. These calculations can be time consuming, depending on the calculation implementation efficiency in a RAID engine. The parity RAID rebuild is the process by which the RAID recalculates data content for a new drive that has been inserted into the RAID configuration to replace a failed drive in the array. During the rebuild process, for each RAID stripe, the RAID engine reads data from the remaining healthy drives and using calculation algorithms, calculates the value that it then writes to the new drive.
Operating in degraded mode, when the RAID loses at least one of its drives and the rebuild process hasn’t commenced, or in rebuild mode, when at least one drive in the RAID is rebuilding, is associated with the following issues:
That’s why the speed of the RAID rebuild is crucial. The faster the rebuild can complete, the less time the RAID array must operate with reduced performance and increased risk of failure. The rebuild time scales linearly with the size of the RAID drives. The larger the failed drive in the array, the longer rebuild will take. As the drive capacities continue to increase, RAID engines are required to be highly efficient and process rebuilds at speeds that match the raw performance of the drives.
Based on concerns about rebuilding RAID arrays with highly dense drives and associated requirement for high software efficiency, we decided to measure the actual performance of Solidigm high-density QLC 61.44TB drives in a RAID5 array with xiRAID. In the following sections, we will walk you through a series of experiments and their results. We selected mdraid, an open-source software RAID solution, for comparison and contrast with our results. We also ran rebuilds with and without host workloads to verify the best-case and worst-case scenarios.
During this test the following sequence was executed for each RAID engine– mdraid and xiRAID Classic 4.3:
1. Created a RAID5 array of 9 data drives with the default parameters.
2. RAID rebuild parameters were set as follows:
3. Wait time for the RAID initialization to complete.
4. Captured nand_bytes_written and host_bytes_written parameters of the new data drive from the SMART log. These were used to calculate the WAF of the drive. This drive did not belong to the RAID and was the replacement drive.
5. Replaced a drive in the RAID array with the drive from step #4. Captured the time of replacement.
6. Captured the rebuild time from the system logs, after the process was completed
7. Captured nand_bytes_written and host_bytes_written parameters of the data drive that was rebuilt as described in step #4.
8. Calculated the WAF using following formula
9. Calculated the rebuild time for the rebuilt drive.
The results are presented in Table 1.
RAID Engine | Rebuild time | Rebuild speed (drive_size/rebuild_time) |
WAF - Drive under rebuild |
mdraid | 53h 40m | 322MB/s | 1.2 |
xiRAID Classic 4.3 | 5h 22m | 3.18GB/s | 1.02 |
Table 1. RAID rebuild without host workload
In the absence of a host workload, rebuild time is 10x better with xiRAID compared to mdraid. Even though rebuild time with mdraid is about 2.25 days (~51 hours) for 61.44TB QLC drive without a workload, with xiRAID it can be rebuilt in matter of hours. xiRAID can leverage the full raw sequential bandwidth of the Solidigm D5-P5336 SSD to deliver fast rebuild time. This type of rebuild can be initiated during expected host workload idle periods.
During this test the following sequence was executed for each RAID engine – mdraid and xiRAID Classic 4.3:
1. Created a RAID5 array of 9 data drives with the default parameters.
2. RAID rebuild parameters were set as follows:
3. Wait time for the RAID initialization to complete.
4. Captured nand_bytes_written and host_bytes_written parameters of the new data drive from the SMART log. These were used to calculate the WAF of the drive. This drive did not belong to the RAID and was the replacement drive.
5. Started following fio mixed workload:
[global]
iodepth=64
direct=1
ioengine=libaio
group_reporting
runtime=604800 # seconds (1 week)
[write]
rw=write
bs=1MB
numjobs=38
offset_increment=15G
filename=/dev/xi_test
[read]
rw=read
bs=1MB
numjobs=90
offset_increment=10G
filename=/dev/xi_test
6. Replaced a drive in the RAID with drive from step #4. Captured the time of replacement.
7. Captured the rebuild time from the system logs, after the process was completed
8. Captured nand_bytes_written and host_bytes_written parameters of the rebuilt data drive again as described in step #4.
9. Calculated the WAF using following formula
10. Calculated the rebuild time for the rebuilt drive.
The mdraid rebuild speed under the host workload (approximately 10.5MB/s) was too slow to complete the full rebuild of a 61.44TB drive in a reasonable amount of time. The rebuild time result for mdraid was approximated based on the rebuild speed.
The results are presented in Table 2.
RAID Engine | Rebuild time | Rebuild speed (drive_size/rebuild_time) |
WAF - Drive under rebuild | Workload speed (Under rebuild) |
mdraid | > 67 days | 10.5 MB/s | 1.58 | Read ~ 100MB/s Write ~ 45MB/s |
xiRAID Classic 4.3 | 53h 53m | 316MB/s | 1.21 | Read – 44GB/s Write – 13GB/s |
Table 2. RAID rebuild with host workload
When host IO cannot wait for the drive rebuild to complete, mdraid is rendered unusable with large-capacity drives. It can take more than two months to rebuild a drive, leaving the RAID array vulnerable to potential data loss from a second drive failure. xiRAID, on the other hand, delivers the same rebuild performance with the host workload running as mdraid delivers without any workload during the rebuild. Solidigm D5-P5336 61.44TB QLC drives in the RAID array can handle mixed IO from the software stack sent via the host workload as well as the rebuild logic.
Our results support debunking the myth that rebuilding high-density drives in the RAID array takes a long time and, therefore, they are not a great option for RAID solutions. While this concern is valid for RAID engines that are not optimized for NVMe drives, as in the case of mdraid, we have demonstrated through our experiments that xiRAID Classic 4.3 can rebuild a Solidigm 61.44TB QLC drive in just a few hours without a host workload. That increases to slightly more than a couple of days in case of concurrent host workload, while keeping WAF close to 1. This also demonstrates that Solidigm high-density QLC SSDs can easily withstand the demand of a high-performance software stack and can keep up with mixed read and write host workloads while servicing rebuild IO. Without the raw performance capability of Solidigm D5-P5336 QLC SSDs, rebuilds could take a long time—months in some cases—and/or host IO would not be serviced properly.
These results demonstrate that xiRAID Classic 4.3 is a much better fit for Solidigm high density QLC drive RAID deployments, keeping RAID accessible to the host even during active and fast rebuilds.
Sarika Mehta is a Senior Storage Solutions Architect at Solidigm, bringing over 16 years of experience from her tenure at Intel’s storage division and Solidigm. Her focus is to work closely with Solidigm customers and partners to optimize their storage solutions for cost and performance. She is responsible for tuning and optimizing Solidigm’s SSDs for various storage use cases in a variety of storage deployments that range from direct-attached storage to tiered and non-tiered disaggregated storage solutions. She has diverse storage background in validation, performance benchmarking, pathfinding, technical marketing, and solutions architecture.
Daniel Landau is Senior Solutions Architect with Xinnor. Daniel has over a decade of experience as a system architect, resolving complex network configurations and system installations. Outside of work, Daniel enjoys travel, photography, and music.
Test System Configuration | |
System | Dell PowerEdge R760 |
BIOS | Vendor: Dell Inc. Version: 2.3.5 |
CPU | Intel(R) Xeon(R) Gold 6430 x 2 2 x sockets @3.4GHz, 32 cores/per socket |
NUMA Nodes | 2 |
DRAM | Total 512G DDR4@3200 MHz |
OS | Rocky Linux 9.5 |
Kernel | 5.14.0-503.22.1el9_5.x86_64 |
SSD | 10 x Solidigm D5-P5336 61.44TB, FW Rev: 5CV10302, PCIe Gen4x4 |
Fio | Version: 3.35 |
xiRAID | Version: 4.3 |
Mdraid | Version: 4.3 2024-02-15 - 3 |
Table 3. System configuration