Complete Guide to Deploying GPU Servers
1. System Initialization and Basic Configuration
1.1 System Updates and Basic Tools

1.2 Creating a Deployment User (Recommended)

1.3 System Security Configuration

2. NVIDIA Driver and CUDA Installation
2.1 Installing NVIDIA Drivers

2.2 Verifying the Installation
bash
Check the driver
nvidia-smi
Check CUDA
nvcc --version
Check GPU availability
nvidia-smi -L
3. Docker and the NVIDIA Container Toolkit
3.1 Install Docker

3.2 Install NVIDIA Container Toolkit

3.3 Verify Docker GPU Support
bash
Testing GPU Availability in the Container
docker run --rm --gpus all nvidia/cuda:12.1-base nvidia-smi
4. Python Deep Learning Environment
4.1 Installing Miniconda

4.2 Creating a Deep Learning Environment

4.3 Verifying the Deep Learning Environment

5. Deployment of Common Deep Learning Frameworks
5.1 Configuring Jupyter Lab

5.2 Creating a System Service

5.3 TensorBoard Configuration

6. Common Tools and Libraries
6.1 Machine Learning Tools

6.2 Computer Vision

6.3 Natural Language Processing

6.4 System Monitoring Tools

7. Data Storage and Backup
7.1 Configuring Data Directories

7.2 Configuring Automatic Backups

8. Production Deployment
8.1 Docker Compose Environment

8.2 Sample Docker Compose Configuration

9. Performance Optimization and Monitoring
9.1 GPU Performance Tuning
bash
Set Persistence Mode
sudo nvidia-smi -pm 1
Set GPU clock frequency (optional)
sudo nvidia-smi -ac 5001,1590
View GPU utilization
watch -n 1 nvidia-smi
9.2 System Monitoring Script

10. Deployment Checklist
10.1 Basic Checks
- [ ] System is updated to the latest version
- [ ] NVIDIA drivers are correctly installed (nvidia-smi displays normally)
- [ ] CUDA Toolkit is installed (nvcc --version returns the correct version)
- [ ] Docker and NVIDIA Container Toolkit are installed
- [ ] Python environment successfully set up
- [ ] PyTorch/TensorFlow GPU versions successfully installed
10.2 Service Check
- [ ] Jupyter Lab service running normally (port 8888)
- [ ] TensorBoard service is running normally (port 6006)
- [ ] Firewall configured correctly
- [ ] SSH key-based login configured
- [ ] Automatic backup script configuration complete
10.3 Performance Check
- [ ] GPU persistence mode is enabled
- [ ] Monitoring script is running normally
- [ ] Data directory structure is intact
- [ ] Backup policy is active
11. Quick Reference for Common Commands
11.1 GPU-Related
bash
Check GPU status
nvidia-smi
watch -n 1 nvidia-smi # Real-time monitoring
View GPU processes
nvidia-smi pmon -i 0 -s um
Kill GPU processes
nvidia-smi --gpu-reset -i 0
11.2 Docker-related
bash
Run a GPU container
docker run --gpus all nvidia/cuda:12.1-base nvidia-smi
Build a GPU image
docker build --tag my-gpu-app .
View container GPU usage
docker stats
11.3 Environment Management
bash
Activate environment
conda activate dl
View installed packages
conda list
Export environment
conda env export > environment.yml
Create an environment from a file
conda env create -f environment.yml
12. Troubleshooting
12.1 Common Issues
1. NVIDIA Driver Issues
- Symptom: nvidia-smi reports an error
- Solution: Reinstall the driver and verify that the kernel version matches
2. CUDA Version Mismatch
- Symptoms: PyTorch/TensorFlow cannot detect the GPU
- Solution: Ensure the CUDA version matches the framework requirements
3. Docker GPU permission issues
- Symptom: Error occurs when running `docker run --gpus all`
- Solution: Check if the user is in the docker group; restart the Docker service
4. Insufficient memory
- Symptom: "CUDA OOM" error during training
- Solution: Reduce the batch size and use gradient accumulation
12.2 Getting Help
bash
View system logs
journalctl -u jupyter
journalctl -u tensorboard
View Docker logs
docker logs container_name
View GPU details
nvidia-smi -q -d MEMORY,UTILIZATION,PIDS,TEMPERATURE
**Once deployment is complete, you will have a fully functional GPU server capable of supporting deep learning training, model deployment, and experiment management. We recommend regularly updating drivers and framework versions to maintain system security and optimal performance.