Arithmetic network cluster interconnect, choose RoCE or InfiniBand?

Published December 5, 2024

Data communication for AI servers involves three components: internal server communication, communication between servers within an AI cluster, and wide-area communication across clusters.High-speed communication between...

Data communication for AI servers involves three components: internal server communication, communication between servers within an AI cluster, and wide-area communication across clusters.

High-speed communication between GPUs within a server primarily uses NVLink. Of course, NVIDIA also utilizes NVLink to build SuperPOD clusters, but its support for GPU scale is relatively limited, making it suitable mainly for small-scale data transfers between server nodes. Large-scale AI clusters primarily rely on RDMA networks, specifically RoCE or InfiniBand.

This article uses a typical NVIDIA A100 server as an example to detail the interconnect architecture between its various components. The internal network configuration of the A100 server is shown in the figure below:

image.png

 

The main modules of the A100 server include: 2 CPUs, 2 InfiniBand storage network interface cards (BF3 DPUs), 4 PCIe Gen4 switch chips, 6 NVSwitch chips, 8 GPUs (A100), and 8 InfiniBand network interface cards. The 8 GPUs are connected in a full-mesh configuration via the 6 NVSwitch chips.

1. Between GPUs within the host, NVLink is used: The A100’s bidirectional bandwidth is 12 × 50 GB/s = 600 GB/s;The A800 is a stripped-down version, with bidirectional bandwidth reduced to 8 × 50 GB/s = 400 GB/s 2. Between GPUs and NICs within the host: GPU <--> PCIe Switch <--> NIC, with a theoretical unidirectional bandwidth of 32 GB/s 3. Between GPUs across hosts:

Data is transmitted via InfiniBand NICs. As shown in the figure below:

image.png

Whether it is the compute network or the storage network, RDMA is required to meet the high-performance demands of AI. The network adopts a Spine-Leaf architecture: 8 GPUs are directly connected to Leaf switches via InfiniBand NICs (HDR, 200 Gbps), and the Leaf switches are connected to Spine switches via a full-mesh topology, forming a cross-host GPU compute network.

The reason the A100 uses HDR InfiniBand network cards is that HDR’s 200 Gbps (i.e., 25 GB/s) unidirectional bandwidth is already close to the theoretical speed of PCIe Gen 4’s 32 GB/s unidirectional bandwidth. Even high-end NDR (400 Gbps unidirectional, i.e., 50 GB/s) would not offer much additional benefit.

Conclusion:

As a native RDMA network, InfiniBand excels in congestion-free and low-latency environments. However, its architecture is relatively closed and costly (at equivalent bandwidth, InfiniBand outperforms RoCE by over 20% but costs twice as much). Therefore, InfiniBand is primarily suitable for small-to-medium-scale cluster scenarios.

RoCE, on the other hand, leverages its mature Ethernet ecosystem, low networking costs, and rapid technological iteration, making it more suitable for medium-to-large-scale training clusters. For example, the 8-GPU servers currently sold by public cloud providers almost exclusively use RoCE networks.


More in AI Academy

How to choose A100, A800, H100, H800 Arithmetic GPU cards for large model training [Ape World Arithmetic AI Academy

Choosing the right GPU depends on your specific needs and use cases. Below is a description of the features and recommended use cases for the A100, A800, H100, and H800 GPUs. You can select the appropriate GPU based on y...

NVIDIA B300 Technology In-Depth Analysis: Architectural Innovation and Enterprise AI Arithmetic Enabling Value

As generative AI evolves toward multimodal capabilities and models with trillions of parameters, and as enterprises’ computing needs shift from “general-purpose computing” to “scenario-specific, precision computing,” NVI...

RTX 5090 Technology Analysis and Enterprise Application Enablement: The Value of Arithmetic Innovation in Four Core Areas

Against the backdrop of enterprise AI R&amp;D delving into models with hundreds of billions of parameters, professional content creation pursuing ultra-high-definition real-time processing, and industrial manufacturing r...

Arithmetic Leasing Selection Alert: A Guide to Avoiding the Three Core Pitfalls | 猿界算力

As digital transformation accelerates, computing power—a core factor of productivity—has become a critical pillar supporting corporate R&amp;D innovation and business expansion. With the rapid expansion of the computing...

Low Latency-High Throughput: How Bare Metal GPUs Reconfigure the HPC and AI Convergence Arithmetic Base

When weather forecasting requires AI models to optimize the accuracy of numerical simulations, when biomedical R&amp;D relies on HPC computing power to analyze molecular structures and uses AI to accelerate drug screenin...