Explanation of 512 H100 clustering solution

Published December 5, 2024

A total of 256 servers represents the limit of a two-tier Spine-Leaf architecture; exceeding 256 servers requires a three-tier architecture, specifically a Core-Spine-Leaf architecture. Therefore, for the 512-server H100...

A total of 256 servers represents the limit of a two-tier Spine-Leaf architecture; exceeding 256 servers requires a three-tier architecture, specifically a Core-Spine-Leaf architecture. Therefore, for the 512-server H100 network configuration we are introducing today, we have designed it as a three-tier IB network.In the three-tier architecture, the Core layer serves as the core switching layer, responsible for high-speed data forwarding and aggregation; the Spine layer acts as the backbone network layer, providing high-speed connectivity and forwarding capabilities; and the Leaf layer functions as the access layer, responsible for connecting servers to the network.

Given the exceptionally high data transmission requirements for large-scale model training, the compute network is designed with a global no-blocking architecture and utilizes a 400 Gb/s IB network (NDR), while the storage network employs a 200 Gb/s IB network (HDR). The complete network architecture diagram for the 512-node cluster is shown below.

image.png

I. Computing Network

The 512 H100 servers are divided into 4 SuperPods, with each SuperPod containing 4 SU units, and each SU unit containing 32 H100 servers. This means each SuperPod has 128 servers.

Every 4 Leaf switches and 4 Spine switches form a Rail Group. Each SuperPod corresponds to 8 Rail Groups, comprising 32 Leaf switches and 32 Spine switches, requiring 16 Core switches for the Core layer.Thus, each SuperPod requires 32 + 32 + 16 = 80 IB switches, and 4 SuperPods require 80 × 4 = 320 IB switches.

Each H100 server is configured with 8 400G network interface cards (NICs) and uses a multi-rail networking architecture, meaning each server’s 8 400G NICs are connected to 8 different Leaf switches.

The three-tier Core-Spine-Leaf network topology diagram is as follows:

image.png

II. Storage Network

The storage system for the 512-node cluster is divided into two parts: high-performance storage and mass storage. High-performance storage uses all-flash drives, configured at a ratio of 1 TB per GPU; thus, 512 H100 servers typically require 4 PB of high-performance storage. Mass storage is configured at 4–5 times the capacity of high-performance storage, with a planned usable capacity of 20 PB.

image.png

The storage network utilizes a 200Gb/s InfiniBand (HDR) network;Each H100 server is equipped with one 200Gb IB network card as the storage access port. The overall network adopts a two-tier Spine-Leaf architecture configured with a global 1:1 convergence ratio, requiring 37 Leaf switches and 20 Spine switches. QM8700-class IB switches are sufficient for this configuration.

image.png

More in AI Academy

How to choose A100, A800, H100, H800 Arithmetic GPU cards for large model training [Ape World Arithmetic AI Academy

Choosing the right GPU depends on your specific needs and use cases. Below is a description of the features and recommended use cases for the A100, A800, H100, and H800 GPUs. You can select the appropriate GPU based on y...

NVIDIA B300 Technology In-Depth Analysis: Architectural Innovation and Enterprise AI Arithmetic Enabling Value

As generative AI evolves toward multimodal capabilities and models with trillions of parameters, and as enterprises’ computing needs shift from “general-purpose computing” to “scenario-specific, precision computing,” NVI...

RTX 5090 Technology Analysis and Enterprise Application Enablement: The Value of Arithmetic Innovation in Four Core Areas

Against the backdrop of enterprise AI R&D delving into models with hundreds of billions of parameters, professional content creation pursuing ultra-high-definition real-time processing, and industrial manufacturing r...

Arithmetic Leasing Selection Alert: A Guide to Avoiding the Three Core Pitfalls | 猿界算力

As digital transformation accelerates, computing power—a core factor of productivity—has become a critical pillar supporting corporate R&D innovation and business expansion. With the rapid expansion of the computing...

Low Latency-High Throughput: How Bare Metal GPUs Reconfigure the HPC and AI Convergence Arithmetic Base

When weather forecasting requires AI models to optimize the accuracy of numerical simulations, when biomedical R&D relies on HPC computing power to analyze molecular structures and uses AI to accelerate drug screenin...