GPU environment deployment process

Complete Guide to Deploying GPU Servers

1. System Initialization and Basic Configuration

1.1 System Updates and Basic Tools

1.2 Creating a Deployment User (Recommended)

1.3 System Security Configuration

2. NVIDIA Driver and CUDA Installation

2.1 Installing NVIDIA Drivers

2.2 Verifying the Installation

bash

Check the driver

nvidia-smi

Check CUDA

nvcc --version

Check GPU availability

nvidia-smi -L

3. Docker and the NVIDIA Container Toolkit

3.1 Install Docker

3.2 Install NVIDIA Container Toolkit

3.3 Verify Docker GPU Support

bash

Testing GPU Availability in the Container

docker run --rm --gpus all nvidia/cuda:12.1-base nvidia-smi

4. Python Deep Learning Environment

4.1 Installing Miniconda

4.2 Creating a Deep Learning Environment

4.3 Verifying the Deep Learning Environment

5. Deployment of Common Deep Learning Frameworks

5.1 Configuring Jupyter Lab

5.2 Creating a System Service

5.3 TensorBoard Configuration

6. Common Tools and Libraries

6.1 Machine Learning Tools

6.2 Computer Vision

6.3 Natural Language Processing

6.4 System Monitoring Tools

7. Data Storage and Backup

7.1 Configuring Data Directories

7.2 Configuring Automatic Backups

8. Production Deployment

8.1 Docker Compose Environment

8.2 Sample Docker Compose Configuration

9. Performance Optimization and Monitoring

9.1 GPU Performance Tuning

bash

Set Persistence Mode

sudo nvidia-smi -pm 1

Set GPU clock frequency (optional)

sudo nvidia-smi -ac 5001,1590

View GPU utilization

watch -n 1 nvidia-smi

9.2 System Monitoring Script

10. Deployment Checklist

10.1 Basic Checks

- [ ] System is updated to the latest version

- [ ] NVIDIA drivers are correctly installed (nvidia-smi displays normally)

- [ ] CUDA Toolkit is installed (nvcc --version returns the correct version)

- [ ] Docker and NVIDIA Container Toolkit are installed

- [ ] Python environment successfully set up

- [ ] PyTorch/TensorFlow GPU versions successfully installed

10.2 Service Check

- [ ] Jupyter Lab service running normally (port 8888)

- [ ] TensorBoard service is running normally (port 6006)

- [ ] Firewall configured correctly

- [ ] SSH key-based login configured

- [ ] Automatic backup script configuration complete

10.3 Performance Check

- [ ] GPU persistence mode is enabled

- [ ] Monitoring script is running normally

- [ ] Data directory structure is intact

- [ ] Backup policy is active

11. Quick Reference for Common Commands

11.1 GPU-Related

bash

Check GPU status

nvidia-smi

watch -n 1 nvidia-smi # Real-time monitoring

View GPU processes

nvidia-smi pmon -i 0 -s um

Kill GPU processes

nvidia-smi --gpu-reset -i 0

11.2 Docker-related

bash

Run a GPU container

docker run --gpus all nvidia/cuda:12.1-base nvidia-smi

Build a GPU image

docker build --tag my-gpu-app .

View container GPU usage

docker stats

11.3 Environment Management

bash

Activate environment

conda activate dl

View installed packages

conda list

Export environment

conda env export > environment.yml

Create an environment from a file

conda env create -f environment.yml

12. Troubleshooting

12.1 Common Issues

1. NVIDIA Driver Issues

- Symptom: nvidia-smi reports an error

- Solution: Reinstall the driver and verify that the kernel version matches

2. CUDA Version Mismatch

- Symptoms: PyTorch/TensorFlow cannot detect the GPU

- Solution: Ensure the CUDA version matches the framework requirements

3. Docker GPU permission issues

- Symptom: Error occurs when running `docker run --gpus all`

- Solution: Check if the user is in the docker group; restart the Docker service

4. Insufficient memory

- Symptom: "CUDA OOM" error during training

- Solution: Reduce the batch size and use gradient accumulation

12.2 Getting Help

bash

View system logs

journalctl -u jupyter

journalctl -u tensorboard

View Docker logs

docker logs container_name

View GPU details

nvidia-smi -q -d MEMORY,UTILIZATION,PIDS,TEMPERATURE

**Once deployment is complete, you will have a fully functional GPU server capable of supporting deep learning training, model deployment, and experiment management. We recommend regularly updating drivers and framework versions to maintain system security and optimal performance.

More in AI Academy

How to choose A100, A800, H100, H800 Arithmetic GPU cards for large model training [Ape World Arithmetic AI Academy

NVIDIA B300 Technology In-Depth Analysis: Architectural Innovation and Enterprise AI Arithmetic Enabling Value

RTX 5090 Technology Analysis and Enterprise Application Enablement: The Value of Arithmetic Innovation in Four Core Areas

Arithmetic Leasing Selection Alert: A Guide to Avoiding the Three Core Pitfalls | 猿界算力

Low Latency-High Throughput: How Bare Metal GPUs Reconfigure the HPC and AI Convergence Arithmetic Base