Getting Started with Distributed Training

A practical primer on launching your first multi-node training run.

This guide walks through a minimal PyTorch FSDP training run on an 8-node H100 cluster. We assume you already have SSH access.

1. Prepare your code

pip install torch==2.4.0 torchrun accelerate

2. Create a launch script

#!/bin/bash
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=ib0
torchrun --nnodes=8 --nproc_per_node=8 \
  --rdzv_backend=c10d --rdzv_endpoint=master:29500 \
  train.py

3. Launch via Slurm

sbatch -N 8 --gpus-per-node=8 launch.sh

4. Monitor

Watch NCCL logs to confirm that all ranks form the communicator. Use nvidia-smi dmon from the head node to confirm per-GPU utilization.