Arithmetic network cluster interconnect, choose RoCE or InfiniBand?

Data communication for AI servers involves three components: internal server communication, communication between servers within an AI cluster, and wide-area communication across clusters.

High-speed communication between GPUs within a server primarily uses NVLink. Of course, NVIDIA also utilizes NVLink to build SuperPOD clusters, but its support for GPU scale is relatively limited, making it suitable mainly for small-scale data transfers between server nodes. Large-scale AI clusters primarily rely on RDMA networks, specifically RoCE or InfiniBand.

This article uses a typical NVIDIA A100 server as an example to detail the interconnect architecture between its various components. The internal network configuration of the A100 server is shown in the figure below:

The main modules of the A100 server include: 2 CPUs, 2 InfiniBand storage network interface cards (BF3 DPUs), 4 PCIe Gen4 switch chips, 6 NVSwitch chips, 8 GPUs (A100), and 8 InfiniBand network interface cards. The 8 GPUs are connected in a full-mesh configuration via the 6 NVSwitch chips.

1. Between GPUs within the host, NVLink is used: The A100’s bidirectional bandwidth is 12 × 50 GB/s = 600 GB/s;The A800 is a stripped-down version, with bidirectional bandwidth reduced to 8 × 50 GB/s = 400 GB/s 2. Between GPUs and NICs within the host: GPU <--> PCIe Switch <--> NIC, with a theoretical unidirectional bandwidth of 32 GB/s 3. Between GPUs across hosts:

Data is transmitted via InfiniBand NICs. As shown in the figure below:

Whether it is the compute network or the storage network, RDMA is required to meet the high-performance demands of AI. The network adopts a Spine-Leaf architecture: 8 GPUs are directly connected to Leaf switches via InfiniBand NICs (HDR, 200 Gbps), and the Leaf switches are connected to Spine switches via a full-mesh topology, forming a cross-host GPU compute network.

The reason the A100 uses HDR InfiniBand network cards is that HDR’s 200 Gbps (i.e., 25 GB/s) unidirectional bandwidth is already close to the theoretical speed of PCIe Gen 4’s 32 GB/s unidirectional bandwidth. Even high-end NDR (400 Gbps unidirectional, i.e., 50 GB/s) would not offer much additional benefit.

Conclusion:

As a native RDMA network, InfiniBand excels in congestion-free and low-latency environments. However, its architecture is relatively closed and costly (at equivalent bandwidth, InfiniBand outperforms RoCE by over 20% but costs twice as much). Therefore, InfiniBand is primarily suitable for small-to-medium-scale cluster scenarios.

RoCE, on the other hand, leverages its mature Ethernet ecosystem, low networking costs, and rapid technological iteration, making it more suitable for medium-to-large-scale training clusters. For example, the 8-GPU servers currently sold by public cloud providers almost exclusively use RoCE networks.

More in AI Academy

How to choose A100, A800, H100, H800 Arithmetic GPU cards for large model training [Ape World Arithmetic AI Academy

NVIDIA B300 Technology In-Depth Analysis: Architectural Innovation and Enterprise AI Arithmetic Enabling Value

RTX 5090 Technology Analysis and Enterprise Application Enablement: The Value of Arithmetic Innovation in Four Core Areas

Arithmetic Leasing Selection Alert: A Guide to Avoiding the Three Core Pitfalls | 猿界算力

Low Latency-High Throughput: How Bare Metal GPUs Reconfigure the HPC and AI Convergence Arithmetic Base