This guide walks through a minimal PyTorch FSDP training run on an 8-node H100 cluster. We assume you already have SSH access.
1. Prepare your code
pip install torch==2.4.0 torchrun accelerate2. Create a launch script
#!/bin/bash
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=ib0
torchrun --nnodes=8 --nproc_per_node=8 \
--rdzv_backend=c10d --rdzv_endpoint=master:29500 \
train.py3. Launch via Slurm
sbatch -N 8 --gpus-per-node=8 launch.sh4. Monitor
Watch NCCL logs to confirm that all ranks form the communicator. Use nvidia-smi dmon from the head node to confirm per-GPU utilization.