Multi-GPU Training#

EmbodiChain supports distributed RL training across multiple GPUs using PyTorch DistributedDataParallel (DDP).

Overview#

  • One process per GPU: Each GPU runs an independent process via torchrun.

  • Independent components per rank: Each process creates its own environment, collector, buffer, and policy copy.

  • Gradient synchronization only: All ranks synchronize gradients after each PPO/GRPO update; no rollout all-gather.

Launch Commands#

Single-Node Multi-GPU#

Use torchrun with --nproc_per_node equal to the number of GPUs, and add --distributed:

torchrun --nproc_per_node=2 -m embodichain.agents.rl.train --config <config_path> --distributed

Example:

torchrun --nproc_per_node=2 -m embodichain.agents.rl.train --config configs/agents/rl/push_cube/train_config.json --distributed

No config file changes needed; device and gpu_id are overridden automatically per rank.

Specifying GPUs#

Use CUDA_VISIBLE_DEVICES to select which GPUs to use. The processes will see only these GPUs as cuda:0, cuda:1, etc.:

# Use GPU 0 and 1
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 -m embodichain.agents.rl.train --config <config_path> --distributed

--nproc_per_node must equal the number of GPUs in CUDA_VISIBLE_DEVICES.

Behavior#

  • Device: Each rank uses cuda:{local_rank}; the simulation and policy run on the assigned GPU.

  • Seeds: Each rank uses seed + rank for different rollout diversity.

  • Total steps: Scaled by world_size; e.g., 2 GPUs collect twice as many steps per iteration.

  • Logging: Only rank 0 writes to TensorBoard, WandB, and console.

  • Episode stats: episode_reward_avg_100 and episode_length_avg_100 are aggregated across all ranks via all_gather for accurate global metrics.

  • Evaluation: Only rank 0 creates and runs the evaluation environment.

  • Checkpoints: Only rank 0 saves; the underlying policy state (without DDP wrapper) is stored.