Getting Started with Distributed Training

A practical primer on launching your first multi-node training run.

This guide walks through a minimal PyTorch FSDP training run on an 8-node H100 cluster. We assume you already have SSH access.

1. Prepare your code

pip install torch==2.4.0 torchrun accelerate

2. Create a launch script

#!/bin/bash
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=ib0
torchrun --nnodes=8 --nproc_per_node=8 \
  --rdzv_backend=c10d --rdzv_endpoint=master:29500 \
  train.py

3. Launch via Slurm

sbatch -N 8 --gpus-per-node=8 launch.sh

4. Monitor

Watch NCCL logs to confirm that all ranks form the communicator. Use nvidia-smi dmon from the head node to confirm per-GPU utilization.

More in AI Academy

NVIDIA B300 Technology In-Depth Analysis: Architectural Innovation and Enterprise AI Arithmetic Enabling Value

As generative AI evolves toward multimodal capabilities and models with trillions of parameters, and as enterprises’ computing needs shift from “general-purpose computing” to “scenario-specific, precision computing,” NVI...

RTX 5090 Technology Analysis and Enterprise Application Enablement: The Value of Arithmetic Innovation in Four Core Areas

Against the backdrop of enterprise AI R&D delving into models with hundreds of billions of parameters, professional content creation pursuing ultra-high-definition real-time processing, and industrial manufacturing r...

Arithmetic Leasing Selection Alert: A Guide to Avoiding the Three Core Pitfalls | 猿界算力

As digital transformation accelerates, computing power—a core factor of productivity—has become a critical pillar supporting corporate R&D innovation and business expansion. With the rapid expansion of the computing...

Low Latency-High Throughput: How Bare Metal GPUs Reconfigure the HPC and AI Convergence Arithmetic Base

When weather forecasting requires AI models to optimize the accuracy of numerical simulations, when biomedical R&D relies on HPC computing power to analyze molecular structures and uses AI to accelerate drug screenin...

8-Card RTX 5090 Test: Wan2.2-T2V/I2V Model Arithmetic Performance at Different Resolutions and Pit Avoidance Guide

As "one-click text-to-video generation" moves from the lab to real-world applications, the compatibility between computing power and models has become a key concern for creators and developers.We built a comput...