Troubleshooting guide for common GPU multi-card servers under Ubuntu

Published November 5, 2025

1. Basic Status CheckObjective: Verify whether the GPU is recognized by the system# 查看所有GPU信息(NVIDIA) nvidia-smi # 查看PCI设备信息(通用) lspci | grep -i nvidia # 检查内核模块加载 lsmod | grep nvidiaSymptoms:No...

1. Basic Status Check

Objective: Verify whether the GPU is recognized by the system

# 查看所有GPU信息(NVIDIA)
nvidia-smi

# 查看PCI设备信息(通用)
lspci | grep -i nvidia

# 检查内核模块加载
lsmod | grep nvidia


Symptoms:

  • No output → Driver not installed

  • DisplayNo devices found → Hardware connection issue

  • Abnormal temperature/power consumption → Cooling failure


2. Troubleshooting Drivers and the CUDA Environment

Objective: Verify driver compatibility

# 检查驱动版本
cat /proc/driver/nvidia/version

# 验证CUDA工具包
nvcc --version

# 测试CUDA基础功能
/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery


Common Issues:

  • Version conflicts: Usesudo apt list --installed | grep nvidiaCheck driver version

  • CUDA errors: Runcuda-install-samples-.sh

    Reinstall the test case


3. Troubleshooting Multi-GPU Communication

Objective: Detect the status of inter-GPU communication (NCCL/P2P)

# 测试NCCL通信
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests && make
./build/all_reduce_perf -b 8 -e 256M -f 2

# 检查P2P访问能力
nvidia-smi topo -m


Key Metrics:

  • P2P Status:nvidia-smiInP2PThe column should beOK

  • NCCL Errors: Log containstransport retry → Network configuration issues


4. Handling resource allocation exceptions

Objective: Resolve VRAM/process conflicts

# 查看GPU进程占用
nvidia-smi --query-compute-apps=pid,used_memory --format=csv

# 强制释放显存(谨慎使用)
sudo kill -9 $(nvidia-smi --query-compute-apps=pid --format=csv,noheader)

# 设置GPU可见性(隔离单卡调试)
CUDA_VISIBLE_DEVICES=0 python test_script.py


Typical scenario:

  • CUDA_ERROR_OUT_OF_MEMORY → Zombie processes consuming GPU memory

  • Uneven load across multiple cards → Check task allocation logic


5. Hardware-Level Deep Detection

Objective: Identify physical hardware failures

# 压力测试(需安装cuda-samples)
/usr/local/cuda/samples/1_Utilities/bandwidthTest/bandwidthTest

# 持续监控工具
nvidia-smi dmon  # 动态刷新指标
nvtop          # 可视化监控(需安装)


Hardware failure characteristics:

  • Bandwidth test failure → PCIe slot or cable failure

  • ECC ErrorsContinuous increase → GPU memory corruption


6. Log Analysis and System-Level Troubleshooting

Objective: Analyze low-level logs

# 查看内核日志(过滤GPU错误)
dmesg | grep -i nvidia

# NVIDIA驱动日志
cat /var/log/nvidia-installer.log

# 系统服务状态(适用于A100/H100)
systemctl status nvidia-persistenced


Log Keywords:

  • GPU has fallen off the bus → Insufficient power

  • Failed to initialize NVML → Driver loading failed

Important Note:

  • For multi-card servers, ensure that "Above 4G Decoding" is enabled in the BIOS

  • Usesudo update-pciidsUpdate the hardware database

  • Recommended: Rundcgmi diag -r 3to perform a comprehensive diagnostic (DCGM installation required)


More in AI Academy

How to choose A100, A800, H100, H800 Arithmetic GPU cards for large model training [Ape World Arithmetic AI Academy

Choosing the right GPU depends on your specific needs and use cases. Below is a description of the features and recommended use cases for the A100, A800, H100, and H800 GPUs. You can select the appropriate GPU based on y...

NVIDIA B300 Technology In-Depth Analysis: Architectural Innovation and Enterprise AI Arithmetic Enabling Value

As generative AI evolves toward multimodal capabilities and models with trillions of parameters, and as enterprises’ computing needs shift from “general-purpose computing” to “scenario-specific, precision computing,” NVI...

RTX 5090 Technology Analysis and Enterprise Application Enablement: The Value of Arithmetic Innovation in Four Core Areas

Against the backdrop of enterprise AI R&D delving into models with hundreds of billions of parameters, professional content creation pursuing ultra-high-definition real-time processing, and industrial manufacturing r...

Arithmetic Leasing Selection Alert: A Guide to Avoiding the Three Core Pitfalls | 猿界算力

As digital transformation accelerates, computing power—a core factor of productivity—has become a critical pillar supporting corporate R&D innovation and business expansion. With the rapid expansion of the computing...

Low Latency-High Throughput: How Bare Metal GPUs Reconfigure the HPC and AI Convergence Arithmetic Base

When weather forecasting requires AI models to optimize the accuracy of numerical simulations, when biomedical R&D relies on HPC computing power to analyze molecular structures and uses AI to accelerate drug screenin...