A100 NVLink configuration optimization full guide

Published November 28, 2025

Multi-GPU NVLink Interconnect Configuration Guide: Unlocking Maximum Performance in A100 ClustersWith its powerful computing capabilities and third-generation NVLink high-speed interconnect technology, the NVIDIA A100 GP...

Multi-GPU NVLink Interconnect Configuration Guide: Unlocking Maximum Performance in A100 Clusters

With its powerful computing capabilities and third-generation NVLink high-speed interconnect technology, the NVIDIA A100 GPU has become the benchmark in high-performance computing and AI training. In multi-GPU collaborative scenarios, communication bandwidth and latency between GPUs often become critical bottlenecks limiting overall performance. Fully leveraging NVLink’s high-bandwidth, low-latency characteristics to build an efficient GPU communication topology is crucial for unlocking the full potential of A100 clusters.This guide will detail how to verify, configure, and optimize NVLink-based multi-GPU interconnect environments on the Yuanjie Computing Platform, providing specific commands and step-by-step instructions.

I. Understanding NVLink and Topology

  1. What is NVLink?

  • NVLink is a high-speed point-to-point interconnect technology developed by NVIDIA, specifically designed to accelerate data transfer between GPUs as well as between GPUs and CPUs/NVSwitch.

  • Third-generation NVLink (for A100) offers a single-link bandwidth of up to 50 GB/s (bidirectional), significantly higher than traditional PCIe bandwidth.

  • Its low-latency characteristics are particularly important for distributed training or HPC applications that require frequent data exchange.

  • A100 NVLink Topology

    • Direct Connect (Peer-to-Peer, P2P): Directly connects to other A100 GPUs within the same node.

    • Connection to NVSwitch: Connects to the NVIDIA NVSwitch chip, which enables the construction of large-scale, full-bandwidth interconnect GPU clusters.

    • A single A100 GPU has 12 NVLink channels.

    • These channels can be used for:

    • In a typical 8-GPU server node, a common topology is to achieve all-to-all interconnection of all 8 GPUs via the NVSwitch, where each GPU has a direct NVLink connection to the other 7 GPUs, providing up to 600 GB/s of aggregate bandwidth.

    II. Verifying Hardware NVLink Connections

    Before configuring the software, you must first verify that the physical NVLink connections are properly established.

    1. View GPU Topology Information (Recommended) Use nvidia-smi 's topology view feature:

      nvidia-smi topo -m
    2. Interpreting the Output:

    • Look for GPUx and GPUy .

    • If NVLink and P2P=Yes (or a similar indication), this indicates that there is a valid NVLink P2P connection between the two GPUs.

    • In an NVSwitch system, you should see NVLink .

    • If it displays SYS (or PCIe), this indicates that the GPUs are communicating only via the system bus (typically PCIe and QPI/UPI), and that the NVLink connection is either disabled or has not been successfully established.

  • Check NVLink Bandwidth (Optional) You can use the p2pBandwidthLatencyTest tools provided by NVIDIA to perform a test:

    # 假设 CUDA 示例程序已安装,通常在 /usr/local/cuda/samples/bin/x86_64/linux/release/
    # 或者需要自行编译 samples/1_Utilities/p2pBandwidthLatencyTest
    p2pBandwidthLatencyTest
  • Check the bandwidth between GPU pairs in the output. The bandwidth of the NVLink connection should be significantly higher than that of a PCIe-only connection (for example, close to 50 GB/s or higher).

  • III. Software Environment Configuration

    Ensure that NVLink is correctly recognized and enabled at the software level.

    1. Install the appropriate drivers and CUDA

    • Use the latest or recommended versions of NVIDIA-certified drivers and the CUDA Toolkit that are compatible with your operating system and hardware.

    • CUDA 11.4 or later is recommended for optimal support for A100 and NVLink.

    • Example installation commands (adjust according to your Linux distribution):

      # Ubuntu/Debian (示例,具体版本号需替换)
      sudo apt-get install cuda-drivers-XXX # 安装驱动包
      sudo apt-get install cuda-toolkit-11-8 # 安装 CUDA Toolkit 11.8


      # RHEL/CentOS (示例,具体版本号需替换)
      sudo yum install nvidia-driver-XXX # 安装驱动包
      sudo yum install cuda-11-8 # 安装 CUDA Toolkit 11.8


    • After installation, run nvidia-smi This should correctly list all A100 GPUs.

  • Configure NCCL Environment Variables NCCL (NVIDIA Collective Communications Library) is the core library used by deep learning frameworks (such as PyTorch and TensorFlow) and HPC applications for inter-GPU communication. Correctly configuring NCCL is essential for utilizing NVLink.

    Recommendation: Place these environment variables in your job submission script (e.g., a Slurm script) or in the user’s shell configuration file (e.g., .bashrc) to ensure they take effect when compute jobs run.

    • Enable NVLink transport: Ensure that NCCL prioritizes the NVLink path. This is typically the default behavior, but can be explicitly enforced via environment variables:

      export NCCL_PROTO=simple # 可选,有时对调试有用,但通常无需设置。simple 协议通常在 NVLink 上表现最佳。
    • Disable SHM (Shared Memory) Fallback: On certain system configurations, NCCL may incorrectly fall back to using shared memory (SHM) for communication, which bypasses NVLink and results in performance degradation. Force the disabling of SHM:

      export NCCL_SHM_DISABLE=1
    • Specify network interface (if multiple NICs are present): If a node has multiple high-speed network interfaces (such as InfiniBand), ensure that NCCL uses the correct interface for inter-node communication:

      export NCCL_IB_HCA=mlx5_0,mlx5_1 # 示例,指定 InfiniBand HCA 设备。请根据实际设备名修改 (使用 `ibdev2netdev` 查看)。
    • Enable debug information (optional, for diagnostics): When debugging connection issues, you can enable NCCL debug output:

      export NCCL_DEBUG=INFO
      # 更详细级别 (谨慎使用,输出量大)
      # export NCCL_DEBUG=TRACE
    • NCCL_DEBUG=INFO During program execution, the topology detected by NCCL will be displayed in the logs, clearly indicating whether communication between GPUs uses NVLink or other paths (such as PCIe, PXB, NVB). Seeing output similar to [0] -> [1] via P2P/NVLink is a good sign.

    • Set communication algorithms (advanced): For specific topologies, you can try different algorithms:

      export NCCL_ALGO=ring # 或者 tree
    • Typically, tree algorithms perform better on NVSwitch fully connected topologies,ring while others may be more suitable for non-fully connected or cross-node scenarios. NCCL typically selects the algorithm automatically based on the topology, but you can manually specify one for comparison testing.

    4. Verifying NVLink Usage in Applications

    1. Performing Health Checks with thedcgm-diag NVIDIA’s Datacenter GPU Manager (DCGM) includes a diagnostic tool dcgm-diagthat can run a series of tests, including NVLink bandwidth tests.

      # 安装 DCGM (可能需要加入 NVIDIA 的 repo)
      # Ubuntu/Debian
      sudo apt-get install datacenter-gpu-manager
      # RHEL/CentOS
      sudo yum install datacenter-gpu-manager
      
      # 运行诊断 (可能需要 root 权限)
      sudo dcgmi diag -r 1 # 运行 Level 1 (快速) 诊断
      # 或更全面的测试
      # sudo dcgmi diag -r 2 # 运行 Level 2 (中等) 诊断
      # sudo dcgmi diag -r 3 # 运行 Level 3 (全面) 诊断
    2. Review the test results to ensure NVLink relevant test items (such as NvLink Bandwidth Test) have passed, and that the reported bandwidth meets expectations.

    3. Performing Performance Tests with nccl-tests The NCCL official website provides a set of performance benchmark programs nccl-tests. This is the most direct tool for verifying multi-card communication bandwidth and latency.

      # 安装依赖 (通常需要)
      sudo apt-get install build-essential libopenmpi-dev # Ubuntu/Debian
      sudo yum groupinstall "Development Tools" && sudo yum install openmpi-devel # RHEL/CentOS
      
      # 下载 nccl-tests
      git clone https://github.com/NVIDIA/nccl-tests.git
      cd nccl-tests
      make MPI=1 NCCL_HOME=/path/to/your/nccl/installation # 如果 NCCL 不在标准路径,需要指定
      # 如果未安装 MPI,可以编译非 MPI 版本 (但通常 MPI 版本更常用)
      # make
    4. Run the test:

    • -np 8: Specify 8 processes (corresponding to 8 GPUs).

    • --host localhost: Specify to run on the local node.

    • -x NCCL_DEBUG=INFO: Pass environment variables to output NCCL topology information.

    • -b 8 -e 128M -f 2: Test data sizes ranging from 8 bytes to 128 MB, with a step size factor of 2.

    • Examine the output Avg bus bandwidth. On an 8-GPU node with A100 + NVSwitch, when using NVLink, the allreduce bandwidth for a 128 MB data set should typically reach ~350 GB/s or higher (theoretical aggregate bandwidth is 600 GB/s, though actual performance is influenced by algorithms and other factors). If the bandwidth is significantly lower than this (e.g., only tens of GB/s), it is likely that NVLink is not being fully utilized.

    • Also examine NCCL_DEBUG=INFO the output to confirm that inter-GPU communication is using NVLink rather than PCIe or other methods.

    • Single-node testing (using NVLink):

      # 假设使用 Open MPI, 测试 8 块 GPU 的 allreduce 带宽
      mpirun -np 8 --host localhost -x NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2
    • Multi-node testing (using a network):

      # 示例:在两个节点 (node1, node2) 上各运行 4 个进程
      mpirun -np 8 --host node1:4,node2:4 -x NCCL_IB_HCA=mlx5_0,mlx5_1 -x NCCL_SHM_DISABLE=1 ./build/all_reduce_perf -b 8 -e 128M -f 2
    • This test evaluates cross-node communication bandwidth, which depends on the performance of the RDMA network (such as InfiniBand). Communication among the four cards within a single node should still utilize NVLink to achieve high bandwidth.

    V. Utilizing NVLink in Deep Learning Frameworks

    Mainstream deep learning frameworks such as PyTorch and TensorFlow typically rely on NCCL for communication when initiating distributed training. Therefore, as long as your environment has correctly configured NCCL (to recognize and use NVLink) according to the steps described above, the framework will automatically benefit from the high bandwidth and low latency advantages of NVLink when performing multi-GPU communication within a single node.

    • PyTorch: Use torch.distributed.launch or torchrun to start distributed training, no special code modifications are required. The framework will invoke the NCCL backend (backend='nccl')。

    • TensorFlow: When using tf.distribute.MirroredStrategy (single-machine, multi-GPU) or MultiWorkerMirroredStrategy (multi-node, multi-GPU) , the underlying implementation also uses NCCL. TensorFlow also supports setting NCCL-related environment variables.

    VI. Frequently Asked Questions and Precautions

    1. Physical Connection Errors: Ensure that the NVLink bridge boards between GPUs inside the server or the cables connecting to the NVSwitch are installed correctly and securely. Poor physical connections are the most common cause of NVLink failure to enable.

    2. PCIe Slot Restrictions: The server’s PCIe topology design affects GPU interconnectivity. Ensure GPUs are installed in slots that support direct NVLink interconnects or are connected to the same NVSwitch. Refer to the topology diagram in the server manual.

    3. MIG (Multi-Instance GPU) Mode: If MIG mode is enabled on an A100 GPU, splitting the GPU into instances may affect NVLink connectivity or restrict it to the boundaries of the MIG instances. It is generally recommended to disable MIG in scenarios requiring full-bandwidth interconnectivity.

    4. Environment Variable Conflicts: Ensure that no other environment variables override the NCCL_SHM_DISABLE=1 and other critical configurations.

    5. Software Version Compatibility: Always maintain compatibility among key components such as drivers, CUDA, and NCCL. Use NVIDIA-certified configuration combinations.

    6. NUMA Affinity: For optimal performance, consider binding processes to specific CPU cores and GPUs to minimize the impact of NUMA effects. Tools such as numactl or taskset can be used for this purpose, and MPI launchers (such as mpirun, srun) typically provide binding options as well.


    Leveraging the NVIDIA A100’s NVLink high-speed interconnect technology is a key step in unlocking the maximum performance of your GPU cluster.By carefully verifying physical connections, correctly configuring the software environment (particularly NCCL-related settings), and using tools for performance testing and monitoring, you can ensure your distributed applications run efficiently on the powerful A100 clusters provided by Yuanjie Computing, achieving optimal training speed and computational efficiency. If you encounter any issues during the configuration process, please feel free to contact Yuanjie Computing’s technical support team.


    Please note:

    • Specific command paths (such as CUDA example paths and NCCL installation paths) may vary depending on system configuration and installation methods; please adjust them according to your actual situation.

    • The location for setting environment variables (.bashrc, Slurm scripts, etc.) depends on your job scheduling system and environment management approach.

    • nccl-tests Compilation may require adjusting Makefile .

    • The bandwidth values obtained from testing are affected by specific hardware configurations (CPU, memory, network), system load, and test parameters; the values provided in this document are reference values for typical scenarios.


    More in AI Academy

    How to choose A100, A800, H100, H800 Arithmetic GPU cards for large model training [Ape World Arithmetic AI Academy

    Choosing the right GPU depends on your specific needs and use cases. Below is a description of the features and recommended use cases for the A100, A800, H100, and H800 GPUs. You can select the appropriate GPU based on y...

    NVIDIA B300 Technology In-Depth Analysis: Architectural Innovation and Enterprise AI Arithmetic Enabling Value

    As generative AI evolves toward multimodal capabilities and models with trillions of parameters, and as enterprises’ computing needs shift from “general-purpose computing” to “scenario-specific, precision computing,” NVI...

    RTX 5090 Technology Analysis and Enterprise Application Enablement: The Value of Arithmetic Innovation in Four Core Areas

    Against the backdrop of enterprise AI R&D delving into models with hundreds of billions of parameters, professional content creation pursuing ultra-high-definition real-time processing, and industrial manufacturing r...

    Arithmetic Leasing Selection Alert: A Guide to Avoiding the Three Core Pitfalls | 猿界算力

    As digital transformation accelerates, computing power—a core factor of productivity—has become a critical pillar supporting corporate R&D innovation and business expansion. With the rapid expansion of the computing...

    Low Latency-High Throughput: How Bare Metal GPUs Reconfigure the HPC and AI Convergence Arithmetic Base

    When weather forecasting requires AI models to optimize the accuracy of numerical simulations, when biomedical R&D relies on HPC computing power to analyze molecular structures and uses AI to accelerate drug screenin...