GPU environment deployment process

Published November 4, 2025

Complete Guide to Deploying GPU Servers1. System Initialization and Basic Configuration1.1 System Updates and Basic Tools 1.2 Creating a Deployment User (Recommended) 1.3 System Security Configuration2. NVIDIA Driver and...

Complete Guide to Deploying GPU Servers


1. System Initialization and Basic Configuration


1.1 System Updates and Basic Tools

image.png


 1.2 Creating a Deployment User (Recommended)

image.png


 1.3 System Security Configuration

image.png


2. NVIDIA Driver and CUDA Installation


2.1 Installing NVIDIA Drivers

image.png

2.2 Verifying the Installation

bash

Check the driver

nvidia-smi


Check CUDA

nvcc --version


Check GPU availability

nvidia-smi -L


3. Docker and the NVIDIA Container Toolkit


3.1 Install Docker

image.png


3.2 Install NVIDIA Container Toolkit

image.png


3.3 Verify Docker GPU Support

bash

Testing GPU Availability in the Container

docker run --rm --gpus all nvidia/cuda:12.1-base nvidia-smi


4. Python Deep Learning Environment


4.1 Installing Miniconda

image.png


4.2 Creating a Deep Learning Environment

image.png


 4.3 Verifying the Deep Learning Environment

image.png


5. Deployment of Common Deep Learning Frameworks


5.1 Configuring Jupyter Lab

image.png


5.2 Creating a System Service

image.png


5.3 TensorBoard Configuration

image.png

6. Common Tools and Libraries


6.1 Machine Learning Tools

image.png



 6.2 Computer Vision

image.png


6.3 Natural Language Processing

image.png


6.4 System Monitoring Tools

image.png


7. Data Storage and Backup


7.1 Configuring Data Directories

image.png

7.2 Configuring Automatic Backups

image.png

8. Production Deployment


8.1 Docker Compose Environment

image.png


8.2 Sample Docker Compose Configuration

image.png

9. Performance Optimization and Monitoring


9.1 GPU Performance Tuning

bash

Set Persistence Mode

sudo nvidia-smi -pm 1


Set GPU clock frequency (optional)

sudo nvidia-smi -ac 5001,1590


View GPU utilization

watch -n 1 nvidia-smi


9.2 System Monitoring Script

image.png




10. Deployment Checklist


10.1 Basic Checks

- [ ] System is updated to the latest version

- [ ] NVIDIA drivers are correctly installed (nvidia-smi displays normally)

- [ ] CUDA Toolkit is installed (nvcc --version returns the correct version)

- [ ] Docker and NVIDIA Container Toolkit are installed

- [ ] Python environment successfully set up

- [ ] PyTorch/TensorFlow GPU versions successfully installed


 10.2 Service Check

- [ ] Jupyter Lab service running normally (port 8888)

- [ ] TensorBoard service is running normally (port 6006)

- [ ] Firewall configured correctly

- [ ] SSH key-based login configured

- [ ] Automatic backup script configuration complete


10.3 Performance Check

- [ ] GPU persistence mode is enabled

- [ ] Monitoring script is running normally

- [ ] Data directory structure is intact

- [ ] Backup policy is active


11. Quick Reference for Common Commands


11.1 GPU-Related

bash

Check GPU status

nvidia-smi

watch -n 1 nvidia-smi # Real-time monitoring


View GPU processes

nvidia-smi pmon -i 0 -s um


Kill GPU processes

nvidia-smi --gpu-reset -i 0



11.2 Docker-related

bash

Run a GPU container

docker run --gpus all nvidia/cuda:12.1-base nvidia-smi


Build a GPU image

docker build --tag my-gpu-app .


View container GPU usage

docker stats


11.3 Environment Management

bash

Activate environment

conda activate dl


View installed packages

conda list


Export environment

conda env export > environment.yml


Create an environment from a file

conda env create -f environment.yml



12. Troubleshooting


12.1 Common Issues

1. NVIDIA Driver Issues

   - Symptom: nvidia-smi reports an error

   - Solution: Reinstall the driver and verify that the kernel version matches


2. CUDA Version Mismatch

   - Symptoms: PyTorch/TensorFlow cannot detect the GPU

   - Solution: Ensure the CUDA version matches the framework requirements


3. Docker GPU permission issues

   - Symptom: Error occurs when running `docker run --gpus all`

   - Solution: Check if the user is in the docker group; restart the Docker service


4. Insufficient memory

   - Symptom: "CUDA OOM" error during training

   - Solution: Reduce the batch size and use gradient accumulation


12.2 Getting Help

bash

View system logs

journalctl -u jupyter

journalctl -u tensorboard


View Docker logs

docker logs container_name


View GPU details

nvidia-smi -q -d MEMORY,UTILIZATION,PIDS,TEMPERATURE



**Once deployment is complete, you will have a fully functional GPU server capable of supporting deep learning training, model deployment, and experiment management. We recommend regularly updating drivers and framework versions to maintain system security and optimal performance.


More in AI Academy

How to choose A100, A800, H100, H800 Arithmetic GPU cards for large model training [Ape World Arithmetic AI Academy

Choosing the right GPU depends on your specific needs and use cases. Below is a description of the features and recommended use cases for the A100, A800, H100, and H800 GPUs. You can select the appropriate GPU based on y...

NVIDIA B300 Technology In-Depth Analysis: Architectural Innovation and Enterprise AI Arithmetic Enabling Value

As generative AI evolves toward multimodal capabilities and models with trillions of parameters, and as enterprises’ computing needs shift from “general-purpose computing” to “scenario-specific, precision computing,” NVI...

RTX 5090 Technology Analysis and Enterprise Application Enablement: The Value of Arithmetic Innovation in Four Core Areas

Against the backdrop of enterprise AI R&D delving into models with hundreds of billions of parameters, professional content creation pursuing ultra-high-definition real-time processing, and industrial manufacturing r...

Arithmetic Leasing Selection Alert: A Guide to Avoiding the Three Core Pitfalls | 猿界算力

As digital transformation accelerates, computing power—a core factor of productivity—has become a critical pillar supporting corporate R&D innovation and business expansion. With the rapid expansion of the computing...

Low Latency-High Throughput: How Bare Metal GPUs Reconfigure the HPC and AI Convergence Arithmetic Base

When weather forecasting requires AI models to optimize the accuracy of numerical simulations, when biomedical R&D relies on HPC computing power to analyze molecular structures and uses AI to accelerate drug screenin...