Key to AGI: 100,000 H100 Super AI Arithmetic Cluster

Published August 26, 2024

Since the release of GPT-4, there have been signs that the global momentum in AI development is slowing down.However, this does not mean that the Scaling Law has failed, nor is it due to a lack of training data; rather,...

Since the release of GPT-4, there have been signs that the global momentum in AI development is slowing down.


However, this does not mean that the Scaling Law has failed, nor is it due to a lack of training data; rather, it is a clear case of hitting a computational power bottleneck.


Specifically, GPT-4 required approximately 2e25 FLOPs for training. Several recently released large models—such as Google’s Gemini Ultra, NVIDIA’s Neumotron 340B, and Meta’s Llama 3 405B—utilized roughly the same amount of training computing power as GPT-4, with no qualitative improvement, and thus failed to unlock new capabilities in the models.



To become the creators of the AI era, tech giants such as OpenAI/Microsoft, xAI, and Meta are all racing to build super AI computing clusters consisting of 100,000 H100 units.


To achieve this goal, money alone is far from sufficient; it involves numerous technical challenges, including energy constraints, network topology, reliability assurance, parallelization schemes, and rack layout.


These technical challenges are obstacles on humanity’s path to AGI, yet they also present enormous investment opportunities.


Recently, SemiAnalysis released a major in-depth report titled "100,000 H100 Clusters: Power, Network Topology, Ethernet vs InfiniBand, Reliability, Failures, Checkpointing," which provides a comprehensive analysis of this topic and is highly valuable.


Below, I will provide an overview of this comprehensive report.


Before diving in, let me list the key conclusions:


  • The number of GPUs determines the survival of AI companies. Currently, the threshold for the top tier of AI companies is a 32,000-card cluster; by next year, this threshold may rise to a 100,000-card (H100) cluster.


  • An AI cluster consisting of 100,000 H100 cards consumes approximately 150 MW of power, requires over $4 billion in capital expenditure, and incurs annual electricity costs of up to $120 million.


  • To support the training of next-generation multimodal large models with trillions of parameters, sophisticated network topology design is required, incorporating technologies such as data parallelism, tensor parallelism, and pipelined parallelism for distributed training.


  • To avoid paying the massive "Nvidia Tax," an increasing number of hyperscalers are choosing Broadcom’s Tomahawk 5 to build their super AI clusters instead of Nvidia’s Spectrum-X. Broadcom’s networking revenue is expected to continue soaring in the future.


Let’s get started.


I. The Current State of AI Infrastructure: Laying the Tracks While Racing Ahead


AI infrastructure has become a major bottleneck for the emergence of next-generation large models.


Some have described the situation at OpenAI as akin to a train pioneering a new frontier: scientists are responsible for driving the train at high speeds, while infrastructure engineers are tasked with laying the tracks ahead—the two processes proceed in tandem.


According to estimates, the capital expenditure for a single super AI cluster exceeds $4 billion, with power consumption as high as 150 MW and annual energy consumption of 1.59 TWh. Calculated at a standard rate of $0.078 per kWh, the annual electricity bill would amount to $124 million.


Despite the exorbitant costs, global tech giants are flocking to the project.


To illustrate the computational power provided by a super AI cluster comprising 100,000 GPUs, let’s run some calculations.


OpenAI trained GPT-4 on approximately 20,000 A100 GPUs over 90 days, achieving a BF16 FLOPS performance of about 2.15e25 FLOPS.

The cluster’s peak throughput was 6.28 BF16 ExaFLOPs per second.


On a supercluster comprising 100,000 H100 chips, this figure would surge to 198 FP16 exaflops per second—a 31.5-fold increase.



Training a trillion-parameter model using H100s can achieve up to 35% FP8 MFU and 40% FP16 MFU.


MFU (Model FLOPs Utilization) is a metric that measures the ratio of a GPU’s actual computational capacity to its theoretical peak during model training. This metric reflects the actual computational capacity of hardware resources when training large models.


Training on a cluster of 100,000 H100 GPUs for 100 days can achieve approximately 6e26 effective FP8 FLOPs.


In other words, training GPT-4 can be completed in just 4 days.


II. The Energy Challenges Behind AI


Behind the computing power bottleneck lie two major hurdles: "energy" and "engineering capabilities."


A cluster consisting of 100,000 H100 units requires approximately 150 MW of power. By comparison, El Capitan, the world’s largest national supercomputer to date, requires only 30 MW—just one-fifth of that amount.


This 150 MW can be broken down into power consumption within the H100 Server itself and power consumption from supporting equipment outside the H100 Server.


Within the H100 server, each GPU consumes approximately 700W, while the accompanying CPU, NIC (Network Interface Card), and PSU (Power Supply Units) for each GPU require about 575W.


Externally, the AI cluster also includes many other devices such as storage servers, network switches, and optical transceivers, which account for about 10% of the total power consumption.


Currently, no data center in the world has the capacity to deploy a 150MW AI cluster. x.AI has even converted an old factory in Memphis, Tennessee, into a data center.


These AI clusters are interconnected via optical communication, and the cost of optical communication is proportional to the transmission distance.


The maximum transmission distance for multimode SR and AOC transceivers is approximately 50 meters.


Long-distance single-mode DR and FR transceivers have transmission distances ranging from 500 meters to 2,000 meters, but their cost is 2.5 times that of the former.


Campus-level 800G coherent optical transceivers can transmit over 2,000 meters, but their cost is more than 10 times higher.



For smaller-scale H100 clusters, the standard approach is to interconnect all GPUs using 400G multimode optical transceivers via Layer 1–2 switches.


For large-scale H100 clusters, additional layers of switches are required, and the cost of optical equipment becomes very high. Different network topologies result in vastly different capital expenditures.


Each data center building can be considered a "compute island," containing multiple "compute pods" interconnected via low-cost copper cables. Multiple "compute islands" are then interconnected via long-distance optical communication.



Currently, it is quite difficult to centrally provide 150 MW of power within a single data center, making the design of the network topology particularly critical.


Some AI companies opt for Broadcom’s Tomahawk 5, others choose InfiniBand, and still others select NVIDIA’s Spectrum-X. Below, we will explore the reasons behind these choices and compare the strengths and weaknesses of these solutions.


III. The Core of AI Infrastructure: Network Topology and Parallel Design


To gain a deep understanding of network topology, one must first grasp three different types of parallel design methods: data parallelism, tensor parallelism, and pipelined parallelism.


1. Data Parallelism


Data parallelism is the simplest form of parallelism, where each GPU holds a complete copy of the model weights and processes a distinct subset of training data.

This parallelism method has the lowest communication requirements, as only gradient data needs to be exchanged between GPUs.


However, data parallelism requires each GPU to have sufficient memory to store the entire model’s weights. For a model like GPT-4, which has 1.8 trillion parameters, this translates to a memory footprint of up to 10.8 TB.



2. Tensor Parallelism


To overcome the memory limitations of data parallelism, tensor parallelism was developed.


Tensor parallelism distributes the work and weights of each model layer across multiple GPUs, typically partitioning along the hidden dimensions. This means that each GPU processes only a portion of the model, rather than the entire model.


In tensor parallelism, GPUs require frequent communication to exchange intermediate computation results, appearing from the outside as a single massive GPU. Therefore, tensor parallelism requires high-bandwidth, low-latency network connections.


Tensor parallelism effectively reduces the memory requirements per GPU. For example, when using 8 levels of tensor parallelism connected via NVLink, the memory used per GPU can be reduced by a factor of 8.



3. Pipeline Parallelism


Another method for overcoming GPU memory limitations is pipeline parallelism.


The core idea of pipeline parallelism is to distribute different layers of the model across different GPUs, with each GPU responsible for computing only a portion of the layers.


Once a GPU completes the forward and backward propagation operations for a layer, it can pass the intermediate results to the next GPU and immediately begin processing the next batch of data.


Using pipeline parallelism reduces the memory capacity required per GPU, as each GPU stores only a portion of the model’s layers.


However, it increases communication traffic between GPUs; after each GPU completes its computations, it must pass the data to the next GPU, which requires an efficient network connection to support rapid data transfer.


Pipeline parallelism places high demands on communication, but not as high as tensor parallelism.



4. 3D Parallelism


To maximize model FLOP utilization (MFU), hyperscalers typically combine these three parallelism techniques to form 3D parallelism.


The specific approach is as follows: First, tensor parallelism is used between GPUs within an H100 server; then, pipelined parallelism is used between nodes within the same compute island; finally, data parallelism is used between different compute islands.



IV. AI Cluster Design Approaches of Hyperscalers


Now that we understand the parallelism design approach, let’s examine the specific designs of the hyperscalers’ super AI computing clusters.


First, let’s examine Meta’s design. As shown in the figure below, this is a computing cluster comprising 32,000 GPUs, organized into 8 compute islands.



GPUs within each compute island are connected via high-bandwidth links, while islands are interconnected via a top-layer switch.


The bandwidth of the top-layer switches is intentionally designed to be lower than the total bandwidth connecting to the lower-layer switches; this design is known as “oversubscription.”


Bandwidth oversubscription can slow down communication between islands, but in practical applications, it typically does not significantly impact performance because not all servers use the maximum bandwidth for communication at the same time.


Implementing bandwidth oversubscription on the top-layer switches balances the trade-off between performance and cost. Although this design may limit the communication bandwidth between islands, effective network management ensures the operational efficiency of the entire cluster while reducing construction and maintenance costs.


In contrast to Meta, Google has designed a network architecture specifically for supporting large-scale TPU computing clusters, known as ICI (Inter Chiplet Interconnect).


The ICI network can support up to 8,960 TPU chips, with each 64-TPU water-cooled rack requiring an expensive 800G connection; during training, only the front-end network is used.


Since the ICI network can only scale to a certain size—unlike GPU clusters, which can scale by adding more network tiers—Google must compensate for this by continuously enhancing the TPU front-end network.



V. The Reliability Nightmare of AI Infrastructure


Reliability is a major challenge facing AI clusters.


During the training of large models, GPU nodes frequently crash or encounter errors. Common errors include GPU HBM ECC errors, GPU driver freezes, optical transceiver failures, and network card overheating.


To ensure the continuity of model training and reduce the mean time to recovery, data centers must maintain hot standby nodes.


When a failure occurs, training must never be stopped; instead, training should continue immediately by switching to a working standby node.


In most cases, simply rebooting the node resolves the issue. However, in some instances, technical personnel must intervene to perform physical diagnostics and replace the hardware.


Sometimes technicians can repair a damaged GPU in just a few hours, but more often, it takes several days for a damaged node to be brought back online for training.



During model training, we need to frequently save model checkpoints to CPU memory or NAND SSDs to guard against errors such as HBM ECC.


When an error occurs, the model weights must be reloaded from the slower memory tier, and training must be restarted.


However, frequent checkpointing can degrade the system’s overall MFU. The cluster must constantly pause to back up the current weights.


Typically, checkpointing occurs once every 100 iterations, meaning you could potentially lose up to 99 steps of useful training.


On a 100,000-card cluster, if each iteration takes 2 seconds, a failure occurring at the 99th iteration would result in the loss of 229 GPU-days of work.


VI. Battle of the Titans: Broadcom Tomahawk 5 vs. NVIDIA Spectrum-X


When building a super AI computing cluster with 100,000 H100 GPUs, there are three primary networking solutions to choose from: Broadcom Tomahawk 5, NVIDIA InfiniBand, and NVIDIA Spectrum-X. Below, we will compare the pros and cons of these three solutions in detail.


In large-scale AI clusters, Spectrum-X offers significant advantages over InfiniBand, including performance, reliability, and cost benefits.


Each Spectrum-X Ethernet SN5600 switch features 128 400G ports, whereas the InfiniBand NDR Quantum-2 switch offers only 64 400G ports.


It is worth noting that Broadcom’s Tomahawk 5 switch ASIC also supports 128 400G ports, placing current InfiniBand solutions at a significant disadvantage.


Compared to Tomahawk, Spectrum-X’s main advantage lies in its first-class support from NVIDIA libraries such as NCCL, whereas using Tomahawk 5 requires extensive internal engineering to achieve maximum throughput.



To avoid paying the hefty "Nvidia Tax," an increasing number of hyperscalers are opting to deploy the Broadcom Tomahawk 5 solution.


Each Tomahawk 5-based switch has the same number of ports as the Spectrum-X SN5600 switch—128 400G ports—and offers comparable performance.


Most customers work directly with ODMs, such as Celestica for switches and Innolight or Eoptolink for transceivers. Consequently, the cost of Tomahawk 5 is significantly lower than that of NVIDIA InfiniBand and also cheaper than NVIDIA Spectrum-X.


However, to achieve performance comparable to Nvidia Spectrum-X with Tomahawk 5, you need sufficient engineering capabilities to optimize the NCCL communication cluster for Tomahawk 5.


Nvidia provides out-of-the-box NCCL communication libraries for Spectrum-X and InfiniBand, but these are not compatible with Broadcom’s Tomahawk 5.


Jensen has consistently referred to Nvidia as a software company, noting that its software ecosystem provides a deep moat. However, an increasing number of AI companies are now attempting to build their own engineering capabilities to avoid paying the hefty "Nvidia Tax."



VII. BOM Cost Estimation: How Much Capex Is Required for a 100,000-Card AI Cluster


Following the qualitative analysis, let’s attempt a quantitative assessment.


The following details the BOM costs for four design options for an AI cluster consisting of 100,000 H100 cards.


These four configurations are as follows:


  • Solution 1: 4-layer InfiniBand network, 32,768 GPU islands, track optimization, 7:1 oversubscription

  • Solution 2: 3-layer SpectrumX network, 32,768 GPU islands, track optimization, 7:1 oversubscription

  • Solution 3: 3-layer InfiniBand network, 24,576 GPU islands, non-rail-optimized, in-node front-end networking

  • Solution 4: 3-layer Broadcom Tomahawk 5 Ethernet network, 32,768 GPU islands, lane optimization, 7:1 oversubscription



It is evident that the Capex for a 100,000-node H100 super AI computing cluster is approximately $4 billion. Capital expenditures vary slightly depending on the selected network type.


Comparing these four options, the cost of a 4-layer InfiniBand network is 1.3 to 1.6 times that of the other options, which is why no one is willing to choose a large-scale InfiniBand network.


Compared to InfiniBand, Spectrum X offers larger compute islands and higher inter-island bandwidth, but it also comes at a significant cost: higher power requirements.


logo

Service Hotline: 400-0896-016
Beijing Headquarters

More in AI Academy

How to choose A100, A800, H100, H800 Arithmetic GPU cards for large model training [Ape World Arithmetic AI Academy

Choosing the right GPU depends on your specific needs and use cases. Below is a description of the features and recommended use cases for the A100, A800, H100, and H800 GPUs. You can select the appropriate GPU based on y...

NVIDIA B300 Technology In-Depth Analysis: Architectural Innovation and Enterprise AI Arithmetic Enabling Value

As generative AI evolves toward multimodal capabilities and models with trillions of parameters, and as enterprises’ computing needs shift from “general-purpose computing” to “scenario-specific, precision computing,” NVI...

RTX 5090 Technology Analysis and Enterprise Application Enablement: The Value of Arithmetic Innovation in Four Core Areas

Against the backdrop of enterprise AI R&D delving into models with hundreds of billions of parameters, professional content creation pursuing ultra-high-definition real-time processing, and industrial manufacturing r...

Arithmetic Leasing Selection Alert: A Guide to Avoiding the Three Core Pitfalls | 猿界算力

As digital transformation accelerates, computing power—a core factor of productivity—has become a critical pillar supporting corporate R&D innovation and business expansion. With the rapid expansion of the computing...

Low Latency-High Throughput: How Bare Metal GPUs Reconfigure the HPC and AI Convergence Arithmetic Base

When weather forecasting requires AI models to optimize the accuracy of numerical simulations, when biomedical R&D relies on HPC computing power to analyze molecular structures and uses AI to accelerate drug screenin...