Next-Gen Hadoop Storage: BlueField-3 and CSAL for Performance and Resilience

Abstract: Traditional Hadoop storage faces challenges in performance and resource efficiency for modern big data workloads. This poster introduces a solution powered by BlueField-3 and Solidigm CSAL software, delivering superior throughput, reduced CPU overhead, and robust data protection via three-way replication, erasure coding (EC), and RAID, ensuring scalable and secure data management.

For the full version of the poster, visit here

Section 1

Hadoop storage architecture with BlueField-3 and Solidigm CSAL

Traditional Hadoop Storage architecture vs. Bluefield-3 + CSAL Hadoop architecture Figure 1. Traditional Hadoop Storage architecture vs. Bluefield-3 + CSAL Hadoop architecture

By adopting the BlueField-3 + CSAL solution, we restructured the three-layer Hadoop storage architecture, bringing the following advantages:

  1. Utilizing the hardware computing resources in BlueField-3, Solidigm's CSAL software supports various data protection backup mechanisms such as  EC, RAID, and multiple copies.
  2. The original architecture of three layers of servers + two layers of switches that has been compressed into two layers of servers + one layer of switches, significantly reducing the read/write latency of stored data.
  3. Offloading storage-related computing resources to the BlueField-3 has saved CPU computing resources, allowing the CPU to focus more effectively on Hadoop computations.

Section 2

Traditional Three-Replica

Traditional three-replica storage architecture vs BlueField-3 + CSAL torage architecture Figure 2. Traditional Three-Replica

High Reliability: Ensures fault tolerance with three copies.

Chained Replication: Data flows sequentially through Datanodes, increasing latency.

ACK Overhead: Multi-step acknowledgment adds complexity and latency.

BlueField-3+ CSAL engaged Active/Backup Architecture

BlueField-3+ CSAL engaged Active/Backup Architecture Figure 3. Bluefield-3+CSAL Active/Backup Architecture

Simplified Workflow: Directly replicate data inside one BlueField-3

Improved Performance: No East-West Traffic, reduced replication steps and lower latency.

Optimized for High-Performance Scenarios: Ideal for latency-sensitive workloads.

An additional management plane mechanism is required to enable rapid failover to the backup node in the event of a Datanode failure.

Section 3

Ecosystem

Training framework from training dataset and inference model with new data.
Data users
Data lake engine
Data lake storage Figure 4. Ecosystem for Training and Inference Models

Section 4

Performance and CPU Comparison

  • 3% performance improvement
  • 20% CPU resourced used
Throughput and CPU comparison LVM vs BlueField-3 + CSAL for sequential 4k write and random 4k write. Figure 5. Throughput and CPU comparison LVM vs BlueField-3 + CSAL
  LVM(MiB/s) BlueField-3+CSAL (MiB/s)
Sequential 4K write 2333 7697
Random 4K read 6863 18712

Table 1. Throughput comparison LVM vs BlueField-3 + CSAL

 

  LVM(Cores) BlueField-3+CSAL (Cores)
Sequential 4K write 6 1
Random 4K read 5 1

Table 2. CPU usage LVM vs BlueField-3 + CSAL

 

Section 5

 

CSAL RAID5F vs MDRAID

 CSAL RAID5F vs MDRAID in KIOPS Figure 6. CSAL RAID5F vs MDRAID – IOPS comparison
CSAL RAID5F vs MDRAID in GiB/s Figure 7. CSAL RAID5F vs MDRAID – Throughput comparison

CSAL offers a feature-rich, robust, QLC-friendly RAID solution — unlocking high-density, high-performance deployments

1. No RMW overhead

2. Built-in write hole protection

3. Scales across multiple cores

4. 4x to 20x better performance vs MDRAID with journal

5. Improved SSD endurance

Section 6

TCO analysis

Solution A: BlueField-3 + CSAL vs Solution B: LVM Datanode

Solution A: BlueField-3 + CSAL vs Solution B: LVM Datanode with CAPEX vs TCO Figure 8. Solution A: BlueField-3 + CSAL vs Solution B: LVM Datanode with CAPEX vs TCO

Solution A (BlueField-3 + CSAL) saves data center space and power compared to Solution B by reusing servers on compute nodes and offloading storage-related computations to the BlueField-3.

Taking the three-node example in the figure, Solution A saves over 50% in TCO CAPEX.

Read about DPU-native flexible storage architecture here


About the Authors

Wayne Gao is a Principal Engineer and Solution Storage Architect at Solidigm. He has worked on Solidigm’s Cloud Storage Acceleration Layer (CSAL) from pathfinding to commercial release. Wayne has over 20 years of storage developer experience, has four U.S. patent filings/grants, and is a published EuroSys paper author.

Bo Li serves as a senior storage solutions architect at Solidigm. With over two decades of experience in system design and development across multiple organizations, he specializes in optimizing the performance of networked and storage solutions. In recent years, Bo has concentrated his efforts on advancing the industry-wide adoption of non-volatile storage technologies.

Mariusz Barczak is a Principal Engineer at Solidigm. He has over 13 years of experience finding innovations in storage software and storage solutions.  His particular expertise is caching solutions, software defined storage, virtualization, and storage analytics. Mariusz holds numerous patents and is active in the open-source community. He is currently focused on leading the Solidigm team for Cloud Storage Acceleration Layer (CSAL) which delivers mixed media solutions combining Solidigm SLC SSDs with other storage components, such as Solidigm QLC SSDs, to deliver efficient and durable storage.