Getting Started with Distributed Training

A practical primer on launching your first multi-node training run.

This guide walks through a minimal PyTorch FSDP training run on an 8-node H100 cluster. We assume you already have SSH access.

1. Prepare your code

pip install torch==2.4.0 torchrun accelerate

2. Create a launch script

#!/bin/bash
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=ib0
torchrun --nnodes=8 --nproc_per_node=8 \
  --rdzv_backend=c10d --rdzv_endpoint=master:29500 \
  train.py

3. Launch via Slurm

sbatch -N 8 --gpus-per-node=8 launch.sh

4. Monitor

Watch NCCL logs to confirm that all ranks form the communicator. Use nvidia-smi dmon from the head node to confirm per-GPU utilization.

More in AI Academy

NVIDIA B300 Technology In-Depth Analysis: Architectural Innovation and Enterprise AI Arithmetic Enabling Value

As generative AI evolves toward multimodal capabilities and models with trillions of parameters, and as enterprises’ computing needs shift from “general-purpose computing” to “scenario-specific, precision computing,” NVI...

RTX 5090 Technology Analysis and Enterprise Application Enablement: The Value of Arithmetic Innovation in Four Core Areas

Against the backdrop of enterprise AI R&D delving into models with hundreds of billions of parameters, professional content creation pursuing ultra-high-definition real-time processing, and industrial manufacturing r...