Multi-GPU Training

Multi-GPU Training#

EmbodiChain supports distributed RL training across multiple GPUs using PyTorch DistributedDataParallel (DDP).

One process per GPU: Each GPU runs an independent process via torchrun.
Independent components per rank: Each process creates its own environment, collector, buffer, and policy copy.
Gradient synchronization only: All ranks synchronize gradients after each PPO/GRPO update; no rollout all-gather.

Use torchrun with --nproc_per_node equal to the number of GPUs, and add --distributed:

torchrun --nproc_per_node=2 -m embodichain train-rl --config <config_path> --distributed

Example:

torchrun --nproc_per_node=2 -m embodichain train-rl --config configs/agents/rl/push_cube/train_config.yaml --distributed

No config file changes needed; device and gpu_id are overridden automatically per rank.

Use CUDA_VISIBLE_DEVICES to select which GPUs to use. The processes will see only these GPUs as cuda:0, cuda:1, etc.:

# Use GPU 0 and 1
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 -m embodichain train-rl --config <config_path> --distributed

--nproc_per_node must equal the number of GPUs in CUDA_VISIBLE_DEVICES.

Device: Each rank uses cuda:{local_rank}; the simulation and policy run on the assigned GPU.
Seeds: Each rank uses seed + rank for different rollout diversity.
Total steps: Scaled by world_size; e.g., 2 GPUs collect twice as many steps per iteration.
Logging: Only rank 0 writes to TensorBoard, WandB, and console.
Episode stats: episode_reward_avg_100 and episode_length_avg_100 are aggregated across all ranks via all_gather for accurate global metrics.
Evaluation: Only rank 0 creates and runs the evaluation environment.
Checkpoints: Only rank 0 saves; the underlying policy state (without DDP wrapper) is stored.