Introduction

The necessity of building a 100,000-card cluster is self-evident; currently, the threshold for joining the top tier of AI companies is already a 32,000-card cluster. It is projected that by 2025, this figure will rise to a 100,000-card (H100, H200) cluster, representing enormous room for growth.

An AI cluster comprising 100,000 H100 cards consumes up to 150 MW of power and requires an investment exceeding $4 billion (approximately 30 billion RMB). Annual energy consumption amounts to approximately $1.59 × 10^9 kWh. At a rate of $0.078 per kWh, annual electricity costs reach $124 million. This staggering figure warrants in-depth consideration of energy consumption and cost-effectiveness.

Challenges

(1) Energy and Space Challenges

Behind the computing power bottleneck lie two major hurdles: "energy" and "engineering capabilities."

"A cluster comprising 100,000 H100 GPUs requires a power draw of up to 150 MW, surpassing the 30 MW of the world’s largest supercomputer, El Capitan—which consumes only one-fifth of the power of the former."

Inside the H100 Server, each GPU consumes approximately 700W of power. To meet its operational requirements, roughly 575W of power is needed to drive the accompanying CPU, network interface card (NIC), and power supply unit (PSU).

Beyond the H100 Server, the AI cluster also includes numerous other devices such as storage servers, network switches, and optical transceivers, which account for approximately 10% of the total power consumption.

X.AI converted a former factory in Memphis, Tennessee, into a data center that consumes 1 million gallons of water and 150 megawatts of electricity daily. Currently, no other data center in the world has the capacity to deploy a 150MW AI cluster.

These AI clusters are interconnected via optical communication, and the cost of optical communication is proportional to the transmission distance.

The maximum transmission distance for multimode SR and AOC transceivers is approximately 50 meters.

In the world of data centers, each building is hailed as a “computing island.”These islands are filled with multiple “computing pods,”connected via cost-effective copper cables or multimode interconnects. To link these islands, we employ long-distance single-mode optical communication technology. This approach is not only highly efficient but also ensures stable data transmission, thereby meeting the demands of modern data centers for high performance and reliability.

Since data parallelism involves relatively low communication volume, it can be distributed across different computing islands:

Currently, within this cluster of over 100,000 nodes, three buildings (three computing islands) have been completed. Each computing island houses approximately 1,000–1,100 server racks, with a total power consumption of about 50 MW.

(2) Network Architecture and Parallelization Strategy

Data Parallelism

This parallelization method requires the least communication, as only gradient data needs to be transferred between GPUs.

However, data parallelism requires each GPU to have sufficient memory to store the entire model’s weights. For the GPT-4 model with 1.8 trillion parameters, this translates to a memory requirement of up to 10.8 TB.

Tensor Parallelism

Tensor parallelism was developed to overcome the memory limitations of data parallelism.

In tensor parallelism, frequent communication between GPUs is required to exchange intermediate computation results, thereby enabling more efficient computation. Consequently, tensor parallelism requires high-bandwidth, low-latency network connections.
Through tensor parallelism, the memory requirements per GPU can be significantly reduced. For example, when using 8 levels of tensor parallelism with NVLink connectivity, the memory used per GPU can be reduced by a factor of 8.

Pipeline Parallelism

Another method for overcoming GPU memory limitations is pipeline parallelism.

Pipeline parallelism is a technique for achieving model parallelism in distributed computing environments, primarily used in the field of deep learning, particularly when handling large-scale neural network models. By distributing different parts of the model (such as layers of a neural network) across different compute nodes, pipeline parallelism enables multiple machines in a cluster to collaborate on model training without sacrificing training efficiency.

Once a GPU completes the forward and backward propagation operations for a layer, it can pass the intermediate results to the next GPU so that it can immediately begin processing the next batch of data. This improves computational efficiency and shortens training time.
Although this introduces inter-GPU communication overhead—since each GPU must pass data to the next after completing its computations—it requires efficient network connectivity to ensure rapid data transfer.

Pipeline parallelism places high demands on communication, though not as high as tensor parallelism.

3D Parallelism

This approach utilizes GPU tensor parallelism within the H100 Server, pipeline parallelism among nodes within a compute island, and data parallelism across compute islands to improve efficiency.

Network Architecture

Network topology design must take into account the parallelization schemes being used.

GPU deployments feature various networks, including front-end, back-end, and expansion networks (NVLink), each supporting different parallelization schemes.

The NVLink network is the only high-speed option capable of meeting the bandwidth demands of tensor parallelism. Although the backend network can easily handle most other types of parallelism, data parallelism becomes the preferred choice when convergence ratio issues arise.

When building a super AI computing cluster with 100,000 H100 GPUs, there are three primary network options: Broadcom Tomahawk 5, NVIDIA InfiniBand, and NVIDIA Spectrum-X.In large-scale AI clusters, Spectrum-X offers significant advantages over InfiniBand, including performance, power consumption, and cost. Spectrum-X is a high-performance Ethernet switch chip platform developed by NVIDIA; it is exclusively used within the Spectrum-X platform and is not sold separately. Each

of these three solutions has its own strengths and weaknesses, and the specific choice requires evaluation based on actual requirements. If you need more information, please refer to relevant literature or consult a professional.

InfiniBand

The advantage of InfiniBand is that Ethernet does not support network reduction within SHARP.

The InfiniBand NDR Quantum-2 switch features 64 400G ports. In contrast, the Spectrum-X Ethernet SN5600 switch and Broadcom’s Tomahawk 5 switch ASIC both provide 128 400G ports, offering higher port density and performance.

"The Quantum-2 switch has limited ports, allowing for a maximum of 65,536 fully interconnected H100 GPUs in a 100,000-node cluster."

The next-generation InfiniBand switch, the Quantum-X800, will address capacity issues with 144 800G ports, but it is only compatible with NVL72 and NVL36 systems, making it unlikely to see widespread adoption in B200 or B100 clusters.

Spectrum-X

Spectrum-X offers unparalleled advantages thanks to first-class support for NVIDIA libraries such as NCCL. By joining their new product line, you will be among the first customers to experience unprecedented innovation.

Spectrum-X must be purchased with NVIDIA LinkX transceivers, as other transceivers may not function properly or have not been validated.

In the 400G Spectrum-X, NVIDIA has adopted Bluefield-3 as a temporary replacement for ConnectX-7, while ConnectX-8 is expected to work seamlessly with the 800G Spectrum-X.

In large data centers, Bluefield-3 and ConnectX-7 cost approximately $300 ASP each, but Bluefield-3 consumes an additional 50 watts of power. Consequently, each node requires an extra 400 watts of power, thereby reducing the overall energy efficiency per petaflop of the training server.

Deploying 100,000 GPUs with Spectrum-X in a data center requires 5 MW of power; in contrast, the Broadcom Tomahawk 5 does not require this power.

To avoid paying high fees to NVIDIA, many customers are opting to deploy switches based on the Broadcom Tomahawk 5. This chip can power 800 Gbps of traffic at 5.5W, reducing the need for pluggable optical modules to drive signals to the switch front end.Additionally, Broadcom launched its latest switch chip, the Tomahawk 5, on Tuesday, capable of interconnecting endpoints with a total bandwidth of 51.2 terabits per second.

Switches based on the Tomahawk 5 feature 128 400G ports, just like the Spectrum-X SN5600 switch, and can achieve equivalent performance if the company has skilled network engineers. Furthermore, you can purchase generic transceivers and copper cables from any vendor and use them in a mixed deployment.

Many customers choose to partner with ODM manufacturers, such as Celestica for switches and Innolight and Eoptolink for transceivers.

"Considering the costs of switches and generic transceivers, Tomahawk 5 is significantly more cost-effective than NVIDIA InfiniBand. Moreover, it offers better value for money compared to NVIDIA Spectrum-X."

Unfortunately, patching and optimizing NCCL communication clusters for Tomahawk 5 requires solid engineering skills. While NCCL is out-of-the-box, it is optimized only for Nvidia Spectrum-X and Nvidia InfiniBand.

If you have $4 billion to spend on 100,000 clusters, you should also have sufficient engineering capabilities to patch and optimize NCCL.
Software development is challenging, yet Semianalysis predicts that hyperscale data centers will shift toward other optimization solutions and abandon InfiniBand.

Rail Optimization

To improve network maintainability and extend the lifespan of copper cables (<3 meters) and multimode fiber (<50 meters), some customers are choosing to abandon NVIDIA’s recommended rail-optimized design in favor of a middle-of-rack design.

"Rail-optimized technology allows each H100 server to connect to eight independent leaf switches rather than converging within a single rack. This design enables each GPU to communicate with more distant GPUs with just a single hop, thereby significantly improving full-to-full inter-GPU communication performance."

For example, full-to-full collective communication is extensively used in Mixed Expert (MoE) parallel computing.

Within the same rack, switches can use passive direct-attach cables (DACs) and active cables (AECs). However, in a rail-optimized design, if switches are located in different positions, optical components are required to establish connections.

Additionally, the distance between leaf switches and spine switches may exceed 50 meters, necessitating the use of single-mode optical transceivers.

With a non-rail-optimized design, you can replace the 98,304 fiber transceivers connecting GPUs and leaf switches with low-cost direct-attach copper cables, thereby increasing the proportion of copper in your GPU links to 25–33%.

DAC copper cables offer significant advantages over optical cables in terms of operating temperature, power consumption, and cost, while also providing higher reliability. This design effectively reduces intermittent network link outages and failures, serving as a key solution to the major challenges faced by optical components in the high-speed interconnect field.

When using DAC copper cables, the Quantum-2IB backbone switch consumes 747 watts; if multimode fiber optic transceivers are used, power consumption rises to 1,500 watts.

Initial cabling is extremely time-consuming for data center technicians, as each link spans 50 meters between two ends that are not in the same rack; however, an optimized rail design helps improve efficiency.

In the mid-rack design, the leaf switch shares the same rack with all connected GPUs. Even before the design is finalized, links from compute nodes to the leaf switch can be tested at the integration factory because all links are within the same rack.

Network Configuration Example

As shown in the figure, this is a common three-layer Fat-Tree topology (SuperSpine-Spine-Leaf), where two Spine-Leaf units form a Pod.

Spine switches and SuperSpine switches must be interconnected, so the number of groups is halved. A pod contains 64 spine switches, corresponding to 8 groups. Consequently, a pod has 64 leaf switches.
With multiple Pods, 64 SuperSpine Fabrics can be further constructed, with each Fabric achieving full interconnection with Spine Switches from different Pods. Taking 8 Pods as an example, the i-th Spine Switch in each of the 8 Pods is connected in a full mesh with the SuperSpine Switch in Fabric i. Since there are 8 Pods, only 4 128-port SuperSpine Switches are required per Fabric.
The above configuration with 8 Pods corresponds to:
Total GPUs: 4096 × 8 = 32,768
SuperSpine Switches: 64 × 4 = 256
Spine Switches: 64 × 8 = 512
Leaf Switches: 64 × 8 = 512
Total Switches: 256 + 512 + 512 = 1,280
Total number of optical modules: 1280 × 128 + 32,768 = 196,608
In theory, a maximum of 128 pods can be supported, corresponding to the following number of devices:
GPUs: 4096 × 128 = 524,288 = 2 × (128/2)³
SuperSpine Switch: 64 × 64 = 4096 = (128/2)²
Spine Switch: 64 × 128 = 8192 = 2 × (128/2)²
Leaf Switch: 64 × 128 = 8,192 = 2 × (128/2)²
Switch Performance Analysis: $4096+8192+8192$ = 20480, equivalent to $5 \times (128/2)^2$. A cluster of 10,000 cards can be scaled by adding three similar pods sequentially.

(3) Reliability and Recovery

Synchronous model training raises reliability concerns for massive clusters. Common issues include GPU HBM ECC errors, GPU driver freezes, fiber optic transceiver failures, and network card overheating.

To minimize downtime, data centers must configure both hot and cold standby equipment. When issues arise, the optimal strategy is to continue training using standby nodes rather than halting operations immediately.

Data center technicians can repair damaged GPU servers within hours, but in some cases, it may take several days for a node to be brought back online.

During model training, to avoid errors such as HBM ECC errors, we need to periodically store checkpoints in CPU memory or on SSD persistent storage. Once an error occurs, it is essential to reload the model and optimizer weights and resume training.

Fault-tolerant training techniques can be used to provide user-level, application-driven methods for handling GPU and network failures.

Unfortunately, frequent checkpoint backups and fault-tolerant training techniques can degrade the system’s overall MFU. The cluster must constantly pause to save weights to persistent storage or CPU memory.

Saving checkpoints only once every 100 iterations can result in significant losses. Taking a cluster with 100,000 cards as an example, if each iteration takes 2 seconds, a failure at the 99th iteration could result in the loss of up to 229 GPU-days of work.

Another fault recovery strategy involves using standby nodes to perform RDMA replication from other GPUs via the backend fabric. This method is highly efficient: backend GPUs operate at speeds of up to 400 Gbps, and each GPU is equipped with 80 GB of HBM memory, so the replication process takes only about 1.6 seconds.

With this strategy, the maximum loss is one step (since more GPU HBMs will receive the weight updates), allowing the task to be completed within 2.3 GPU-days, plus the 1.85 GPU-days required to copy the weights via RDMA from other GPUs’ HBM memory.

Many leading AI labs have adopted this technology, yet numerous smaller companies still cling to cumbersome, slow, and inefficient methods—restarting the process to recover from failures. Achieving fault recovery through memory reconstruction can significantly improve the MFU efficiency of large-scale training runs, saving several percentage points in time.

In the realm of network failures, InfiniBand/RoCE link failures are the most common issue. Despite the large number of transceivers, the first job failure in a brand-new, fully operational cluster takes only 26.28 minutes, even with an average failure rate of 5 years per link from each network card to the bottom-tier switch.

In a 100,000-GPU cluster, the time required to restart operations due to fiber failures far exceeds the time spent on model computation; fault recovery strategies that do not utilize memory reconstruction will negatively impact efficiency.

Since GPUs are directly connected to ConnectX-7 NICs, the network architecture lacks fault-tolerance design, forcing users to address failures within their training code, thereby increasing codebase complexity.

Large Language Models (LLMs) utilize tensor parallelism within nodes; if a single network card, transceiver, or GPU fails, the entire server crashes. Since this strategy involves significant network traffic, it requires high-speed communication bandwidth between different computing devices within the server.

Currently, significant efforts are underway to make the network reconfigurable and reduce node vulnerability. This work is critical because the current state means that the entire GB200 NVL72 can go down due to a single GPU or optical failure.
The RAS engine accurately predicts potential failures by deeply analyzing key chip-level data such as temperature, ECC retry counts, clock speed, and voltage, and promptly notifies data center engineers to ensure stable system operation.

"This enables the technical team to perform proactive maintenance, such as increasing fan speeds to maintain stability, and removing servers from the active queue during maintenance windows for in-depth inspections."

Before a training task begins, the RAS engine in each chip performs a comprehensive self-check, such as executing matrix multiplication with known results to detect silent data corruption (SDC).

(4) Bill of Materials

Specifically, it can be divided into four categories (the original text states 7:1, but it should actually be 8:1?):

"A powerful 4-layer InfiniBand network with 32,768 GPU clusters and track-optimized technology, delivering a 7:1 improvement in convergence speed."

The Spectrum X network is an Ethernet platform developed by NVIDIA. It is an Ethernet platform specifically designed to improve the performance and efficiency of Ethernet-based AI clouds. This network platform features a 3-tier architecture, comprising 32,768 GPU clusters, a track-optimized design, and a 7:1 convergence ratio.

3. A 3-layer InfiniBand network comprising 24,576 GPU clusters, featuring a non-track-optimized design for inter-cluster connectivity in the front-end network.

"Equipped with a 3-layer Broadcom Tomahawk 5 Ethernet network, featuring 32,768 GPU clusters, track-optimized design, and a 7:1 convergence ratio."

Upon comparison, Option 1 is 1.3 to 1.6 times more expensive than the other options; Option 2 offers larger clusters, higher bandwidth, and comparable costs, but consumes more power; Option 3 may result in significantly reduced flexibility for parallel solutions.

The 32k cluster based on Broadcom Tomahawk 5, paired with a 7:1 convergence ratio, is the most cost-effective option. This is why many companies have chosen to build similar networks.

(5) Floor Plan

Finally, when designing the cluster, the rack layout must also be optimized.

This is because if multimode transceivers are placed at the end of a row, the backbone switches in the middle will be out of range.

The floor plan for a 32k Spectrum-X/Tomahawk 5 cluster, featuring a rail-optimized design, is estimated to require at least 80×60 meters of floor space.

Currently, this massive cluster comprises over 100,000 nodes, with three buildings already completed (comprising three compute islands). Each compute island houses approximately 1,000 to 1,100 racks, with a total power consumption of about 50 MW.

100kcal H100 Computing Cluster Build: From Technical Challenges to Implementation Options