Multi-GPU NVLink Interconnect Configuration Guide: Unlocking Maximum Performance in A100 Clusters
With its powerful computing capabilities and third-generation NVLink high-speed interconnect technology, the NVIDIA A100 GPU has become the benchmark in high-performance computing and AI training. In multi-GPU collaborative scenarios, communication bandwidth and latency between GPUs often become critical bottlenecks limiting overall performance. Fully leveraging NVLink’s high-bandwidth, low-latency characteristics to build an efficient GPU communication topology is crucial for unlocking the full potential of A100 clusters.This guide will detail how to verify, configure, and optimize NVLink-based multi-GPU interconnect environments on the Yuanjie Computing Platform, providing specific commands and step-by-step instructions.
I. Understanding NVLink and Topology
What is NVLink?
NVLink is a high-speed point-to-point interconnect technology developed by NVIDIA, specifically designed to accelerate data transfer between GPUs as well as between GPUs and CPUs/NVSwitch.
Third-generation NVLink (for A100) offers a single-link bandwidth of up to 50 GB/s (bidirectional), significantly higher than traditional PCIe bandwidth.
Its low-latency characteristics are particularly important for distributed training or HPC applications that require frequent data exchange.
A100 NVLink Topology
Direct Connect (Peer-to-Peer, P2P): Directly connects to other A100 GPUs within the same node.
Connection to NVSwitch: Connects to the NVIDIA NVSwitch chip, which enables the construction of large-scale, full-bandwidth interconnect GPU clusters.
A single A100 GPU has 12 NVLink channels.
These channels can be used for:
In a typical 8-GPU server node, a common topology is to achieve all-to-all interconnection of all 8 GPUs via the NVSwitch, where each GPU has a direct NVLink connection to the other 7 GPUs, providing up to 600 GB/s of aggregate bandwidth.
II. Verifying Hardware NVLink Connections
Before configuring the software, you must first verify that the physical NVLink connections are properly established.
View GPU Topology Information (Recommended) Use
nvidia-smi's topology view feature:nvidia-smi topo -m
Interpreting the Output:
Look for
GPUxandGPUy.If
NVLinkandP2P=Yes(or a similar indication), this indicates that there is a valid NVLink P2P connection between the two GPUs.In an NVSwitch system, you should see
NVLink.If it displays
SYS(orPCIe), this indicates that the GPUs are communicating only via the system bus (typically PCIe and QPI/UPI), and that the NVLink connection is either disabled or has not been successfully established.
Check NVLink Bandwidth (Optional) You can use the p2pBandwidthLatencyTest tools provided by NVIDIA to perform a test:
# 假设 CUDA 示例程序已安装,通常在 /usr/local/cuda/samples/bin/x86_64/linux/release/ # 或者需要自行编译 samples/1_Utilities/p2pBandwidthLatencyTest p2pBandwidthLatencyTest
Check the bandwidth between GPU pairs in the output. The bandwidth of the NVLink connection should be significantly higher than that of a PCIe-only connection (for example, close to 50 GB/s or higher).
III. Software Environment Configuration
Ensure that NVLink is correctly recognized and enabled at the software level.
Install the appropriate drivers and CUDA
Use the latest or recommended versions of NVIDIA-certified drivers and the CUDA Toolkit that are compatible with your operating system and hardware.
CUDA 11.4 or later is recommended for optimal support for A100 and NVLink.
Example installation commands (adjust according to your Linux distribution):
# Ubuntu/Debian (示例,具体版本号需替换) sudo apt-get install cuda-drivers-XXX # 安装驱动包 sudo apt-get install cuda-toolkit-11-8 # 安装 CUDA Toolkit 11.8
# RHEL/CentOS (示例,具体版本号需替换) sudo yum install nvidia-driver-XXX # 安装驱动包 sudo yum install cuda-11-8 # 安装 CUDA Toolkit 11.8
After installation, run
nvidia-smiThis should correctly list all A100 GPUs.
Configure NCCL Environment Variables NCCL (NVIDIA Collective Communications Library) is the core library used by deep learning frameworks (such as PyTorch and TensorFlow) and HPC applications for inter-GPU communication. Correctly configuring NCCL is essential for utilizing NVLink.
Recommendation: Place these environment variables in your job submission script (e.g., a Slurm script) or in the user’s shell configuration file (e.g., .bashrc) to ensure they take effect when compute jobs run.
Enable NVLink transport: Ensure that NCCL prioritizes the NVLink path. This is typically the default behavior, but can be explicitly enforced via environment variables:
export NCCL_PROTO=simple # 可选,有时对调试有用,但通常无需设置。simple 协议通常在 NVLink 上表现最佳。
Disable SHM (Shared Memory) Fallback: On certain system configurations, NCCL may incorrectly fall back to using shared memory (SHM) for communication, which bypasses NVLink and results in performance degradation. Force the disabling of SHM:
export NCCL_SHM_DISABLE=1
Specify network interface (if multiple NICs are present): If a node has multiple high-speed network interfaces (such as InfiniBand), ensure that NCCL uses the correct interface for inter-node communication:
export NCCL_IB_HCA=mlx5_0,mlx5_1 # 示例,指定 InfiniBand HCA 设备。请根据实际设备名修改 (使用 `ibdev2netdev` 查看)。
Enable debug information (optional, for diagnostics): When debugging connection issues, you can enable NCCL debug output:
export NCCL_DEBUG=INFO # 更详细级别 (谨慎使用,输出量大) # export NCCL_DEBUG=TRACE
NCCL_DEBUG=INFODuring program execution, the topology detected by NCCL will be displayed in the logs, clearly indicating whether communication between GPUs uses NVLink or other paths (such as PCIe, PXB, NVB). Seeing output similar to[0] -> [1] via P2P/NVLinkis a good sign.Set communication algorithms (advanced): For specific topologies, you can try different algorithms:
export NCCL_ALGO=ring # 或者 tree
Typically,
treealgorithms perform better on NVSwitch fully connected topologies,ringwhile others may be more suitable for non-fully connected or cross-node scenarios. NCCL typically selects the algorithm automatically based on the topology, but you can manually specify one for comparison testing.
4. Verifying NVLink Usage in Applications
Performing Health Checks with the
dcgm-diagNVIDIA’s Datacenter GPU Manager (DCGM) includes a diagnostic tooldcgm-diagthat can run a series of tests, including NVLink bandwidth tests.# 安装 DCGM (可能需要加入 NVIDIA 的 repo) # Ubuntu/Debian sudo apt-get install datacenter-gpu-manager # RHEL/CentOS sudo yum install datacenter-gpu-manager # 运行诊断 (可能需要 root 权限) sudo dcgmi diag -r 1 # 运行 Level 1 (快速) 诊断 # 或更全面的测试 # sudo dcgmi diag -r 2 # 运行 Level 2 (中等) 诊断 # sudo dcgmi diag -r 3 # 运行 Level 3 (全面) 诊断
Review the test results to ensure
NVLinkrelevant test items (such asNvLink Bandwidth Test) have passed, and that the reported bandwidth meets expectations.Performing Performance Tests with
nccl-testsThe NCCL official website provides a set of performance benchmark programsnccl-tests. This is the most direct tool for verifying multi-card communication bandwidth and latency.# 安装依赖 (通常需要) sudo apt-get install build-essential libopenmpi-dev # Ubuntu/Debian sudo yum groupinstall "Development Tools" && sudo yum install openmpi-devel # RHEL/CentOS # 下载 nccl-tests git clone https://github.com/NVIDIA/nccl-tests.git cd nccl-tests make MPI=1 NCCL_HOME=/path/to/your/nccl/installation # 如果 NCCL 不在标准路径,需要指定 # 如果未安装 MPI,可以编译非 MPI 版本 (但通常 MPI 版本更常用) # make
Run the test:
-np 8: Specify 8 processes (corresponding to 8 GPUs).--host localhost: Specify to run on the local node.-x NCCL_DEBUG=INFO: Pass environment variables to output NCCL topology information.-b 8 -e 128M -f 2: Test data sizes ranging from 8 bytes to 128 MB, with a step size factor of 2.Examine the output
Avg bus bandwidth. On an 8-GPU node with A100 + NVSwitch, when using NVLink, the allreduce bandwidth for a 128 MB data set should typically reach ~350 GB/s or higher (theoretical aggregate bandwidth is 600 GB/s, though actual performance is influenced by algorithms and other factors). If the bandwidth is significantly lower than this (e.g., only tens of GB/s), it is likely that NVLink is not being fully utilized.Also examine
NCCL_DEBUG=INFOthe output to confirm that inter-GPU communication is usingNVLinkrather thanPCIeor other methods.
Single-node testing (using NVLink):
# 假设使用 Open MPI, 测试 8 块 GPU 的 allreduce 带宽 mpirun -np 8 --host localhost -x NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2
Multi-node testing (using a network):
# 示例:在两个节点 (node1, node2) 上各运行 4 个进程 mpirun -np 8 --host node1:4,node2:4 -x NCCL_IB_HCA=mlx5_0,mlx5_1 -x NCCL_SHM_DISABLE=1 ./build/all_reduce_perf -b 8 -e 128M -f 2
This test evaluates cross-node communication bandwidth, which depends on the performance of the RDMA network (such as InfiniBand). Communication among the four cards within a single node should still utilize NVLink to achieve high bandwidth.
V. Utilizing NVLink in Deep Learning Frameworks
Mainstream deep learning frameworks such as PyTorch and TensorFlow typically rely on NCCL for communication when initiating distributed training. Therefore, as long as your environment has correctly configured NCCL (to recognize and use NVLink) according to the steps described above, the framework will automatically benefit from the high bandwidth and low latency advantages of NVLink when performing multi-GPU communication within a single node.
PyTorch: Use
torch.distributed.launchortorchrunto start distributed training, no special code modifications are required. The framework will invoke the NCCL backend (backend='nccl')。TensorFlow: When using
tf.distribute.MirroredStrategy(single-machine, multi-GPU) orMultiWorkerMirroredStrategy(multi-node, multi-GPU) , the underlying implementation also uses NCCL. TensorFlow also supports setting NCCL-related environment variables.
VI. Frequently Asked Questions and Precautions
Physical Connection Errors: Ensure that the NVLink bridge boards between GPUs inside the server or the cables connecting to the NVSwitch are installed correctly and securely. Poor physical connections are the most common cause of NVLink failure to enable.
PCIe Slot Restrictions: The server’s PCIe topology design affects GPU interconnectivity. Ensure GPUs are installed in slots that support direct NVLink interconnects or are connected to the same NVSwitch. Refer to the topology diagram in the server manual.
MIG (Multi-Instance GPU) Mode: If MIG mode is enabled on an A100 GPU, splitting the GPU into instances may affect NVLink connectivity or restrict it to the boundaries of the MIG instances. It is generally recommended to disable MIG in scenarios requiring full-bandwidth interconnectivity.
Environment Variable Conflicts: Ensure that no other environment variables override the
NCCL_SHM_DISABLE=1and other critical configurations.Software Version Compatibility: Always maintain compatibility among key components such as drivers, CUDA, and NCCL. Use NVIDIA-certified configuration combinations.
NUMA Affinity: For optimal performance, consider binding processes to specific CPU cores and GPUs to minimize the impact of NUMA effects. Tools such as
numactlortasksetcan be used for this purpose, and MPI launchers (such asmpirun,srun) typically provide binding options as well.
Leveraging the NVIDIA A100’s NVLink high-speed interconnect technology is a key step in unlocking the maximum performance of your GPU cluster.By carefully verifying physical connections, correctly configuring the software environment (particularly NCCL-related settings), and using tools for performance testing and monitoring, you can ensure your distributed applications run efficiently on the powerful A100 clusters provided by Yuanjie Computing, achieving optimal training speed and computational efficiency. If you encounter any issues during the configuration process, please feel free to contact Yuanjie Computing’s technical support team.
Please note:
Specific command paths (such as CUDA example paths and NCCL installation paths) may vary depending on system configuration and installation methods; please adjust them according to your actual situation.
The location for setting environment variables (
.bashrc, Slurm scripts, etc.) depends on your job scheduling system and environment management approach.nccl-testsCompilation may require adjustingMakefile.The bandwidth values obtained from testing are affected by specific hardware configurations (CPU, memory, network), system load, and test parameters; the values provided in this document are reference values for typical scenarios.