1. Basic Status Check
Objective: Verify whether the GPU is recognized by the system
# 查看所有GPU信息(NVIDIA) nvidia-smi # 查看PCI设备信息(通用) lspci | grep -i nvidia # 检查内核模块加载 lsmod | grep nvidia
Symptoms:
No output → Driver not installed
Display
No devices found→ Hardware connection issueAbnormal temperature/power consumption → Cooling failure
2. Troubleshooting Drivers and the CUDA Environment
Objective: Verify driver compatibility
# 检查驱动版本 cat /proc/driver/nvidia/version # 验证CUDA工具包 nvcc --version # 测试CUDA基础功能 /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery
Common Issues:
Version conflicts: Use
sudo apt list --installed | grep nvidiaCheck driver versionCUDA errors: Run
cuda-install-samples-.sh Reinstall the test case
3. Troubleshooting Multi-GPU Communication
Objective: Detect the status of inter-GPU communication (NCCL/P2P)
# 测试NCCL通信 git clone https://github.com/NVIDIA/nccl-tests.git cd nccl-tests && make ./build/all_reduce_perf -b 8 -e 256M -f 2 # 检查P2P访问能力 nvidia-smi topo -m
Key Metrics:
P2P Status:
nvidia-smiInP2PThe column should beOKNCCL Errors: Log contains
transport retry→ Network configuration issues
4. Handling resource allocation exceptions
Objective: Resolve VRAM/process conflicts
# 查看GPU进程占用 nvidia-smi --query-compute-apps=pid,used_memory --format=csv # 强制释放显存(谨慎使用) sudo kill -9 $(nvidia-smi --query-compute-apps=pid --format=csv,noheader) # 设置GPU可见性(隔离单卡调试) CUDA_VISIBLE_DEVICES=0 python test_script.py
Typical scenario:
CUDA_ERROR_OUT_OF_MEMORY→ Zombie processes consuming GPU memoryUneven load across multiple cards → Check task allocation logic
5. Hardware-Level Deep Detection
Objective: Identify physical hardware failures
# 压力测试(需安装cuda-samples) /usr/local/cuda/samples/1_Utilities/bandwidthTest/bandwidthTest # 持续监控工具 nvidia-smi dmon # 动态刷新指标 nvtop # 可视化监控(需安装)
Hardware failure characteristics:
Bandwidth test failure → PCIe slot or cable failure
ECC ErrorsContinuous increase → GPU memory corruption
6. Log Analysis and System-Level Troubleshooting
Objective: Analyze low-level logs
# 查看内核日志(过滤GPU错误) dmesg | grep -i nvidia # NVIDIA驱动日志 cat /var/log/nvidia-installer.log # 系统服务状态(适用于A100/H100) systemctl status nvidia-persistenced
Log Keywords:
GPU has fallen off the bus→ Insufficient powerFailed to initialize NVML→ Driver loading failed
Important Note:
For multi-card servers, ensure that "Above 4G Decoding" is enabled in the BIOS
Use
sudo update-pciidsUpdate the hardware databaseRecommended: Run
dcgmi diag -r 3to perform a comprehensive diagnostic (DCGM installation required)