Troubleshooting guide for common GPU multi-card servers under Ubuntu

1. Basic Status Check

Objective: Verify whether the GPU is recognized by the system

# 查看所有GPU信息（NVIDIA）
nvidia-smi

# 查看PCI设备信息（通用）
lspci | grep -i nvidia

# 检查内核模块加载
lsmod | grep nvidia

Symptoms:

No output → Driver not installed
DisplayNo devices found → Hardware connection issue
Abnormal temperature/power consumption → Cooling failure

2. Troubleshooting Drivers and the CUDA Environment

Objective: Verify driver compatibility

# 检查驱动版本
cat /proc/driver/nvidia/version

# 验证CUDA工具包
nvcc --version

# 测试CUDA基础功能
/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery

Common Issues:

Version conflicts: Usesudo apt list --installed | grep nvidiaCheck driver version
CUDA errors: Runcuda-install-samples-.sh
Reinstall the test case

3. Troubleshooting Multi-GPU Communication

Objective: Detect the status of inter-GPU communication (NCCL/P2P)

# 测试NCCL通信
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests && make
./build/all_reduce_perf -b 8 -e 256M -f 2

# 检查P2P访问能力
nvidia-smi topo -m

Key Metrics:

P2P Status:nvidia-smiInP2PThe column should beOK
NCCL Errors: Log containstransport retry → Network configuration issues

4. Handling resource allocation exceptions

Objective: Resolve VRAM/process conflicts

# 查看GPU进程占用
nvidia-smi --query-compute-apps=pid,used_memory --format=csv

# 强制释放显存（谨慎使用）
sudo kill -9 $(nvidia-smi --query-compute-apps=pid --format=csv,noheader)

# 设置GPU可见性（隔离单卡调试）
CUDA_VISIBLE_DEVICES=0 python test_script.py

Typical scenario:

CUDA_ERROR_OUT_OF_MEMORY → Zombie processes consuming GPU memory
Uneven load across multiple cards → Check task allocation logic

5. Hardware-Level Deep Detection

Objective: Identify physical hardware failures

# 压力测试（需安装cuda-samples）
/usr/local/cuda/samples/1_Utilities/bandwidthTest/bandwidthTest

# 持续监控工具
nvidia-smi dmon  # 动态刷新指标
nvtop          # 可视化监控（需安装）

Hardware failure characteristics:

Bandwidth test failure → PCIe slot or cable failure
ECC ErrorsContinuous increase → GPU memory corruption

6. Log Analysis and System-Level Troubleshooting

Objective: Analyze low-level logs

# 查看内核日志（过滤GPU错误）
dmesg | grep -i nvidia

# NVIDIA驱动日志
cat /var/log/nvidia-installer.log

# 系统服务状态（适用于A100/H100）
systemctl status nvidia-persistenced

Log Keywords:

GPU has fallen off the bus → Insufficient power
Failed to initialize NVML → Driver loading failed

Important Note:
For multi-card servers, ensure that "Above 4G Decoding" is enabled in the BIOS
Usesudo update-pciidsUpdate the hardware database
Recommended: Rundcgmi diag -r 3to perform a comprehensive diagnostic (DCGM installation required)

1. Basic Status Check

2. Troubleshooting Drivers and the CUDA Environment

3. Troubleshooting Multi-GPU Communication

4. Handling resource allocation exceptions

5. Hardware-Level Deep Detection

6. Log Analysis and System-Level Troubleshooting

More in AI Academy

How to choose A100, A800, H100, H800 Arithmetic GPU cards for large model training [Ape World Arithmetic AI Academy

NVIDIA B300 Technology In-Depth Analysis: Architectural Innovation and Enterprise AI Arithmetic Enabling Value

RTX 5090 Technology Analysis and Enterprise Application Enablement: The Value of Arithmetic Innovation in Four Core Areas

Arithmetic Leasing Selection Alert: A Guide to Avoiding the Three Core Pitfalls | 猿界算力

Low Latency-High Throughput: How Bare Metal GPUs Reconfigure the HPC and AI Convergence Arithmetic Base