Reinforcement Learning Training

Reinforcement Learning Training#

This tutorial shows you how to train reinforcement learning agents using EmbodiChain’s RL framework. You’ll learn how to configure training via JSON, set up environments, policies, and algorithms, and launch training sessions.

Overview#

The RL framework provides a modular, extensible stack for robotics tasks:

Trainer: Orchestrates the training loop (calls algorithm for data collection and updates, handles logging/eval/save)
Algorithm: Controls data collection process (interacts with environment, fills buffer, computes advantages/returns) and updates the policy (e.g., PPO)
Policy: Neural network models implementing a unified interface (get_action/get_value/evaluate_actions)
Buffer: On-policy rollout storage and minibatch iterator (managed by algorithm)
Env Factory: Build environments from a JSON config via registry

Architecture#

The framework follows a clean separation of concerns:

Trainer: Orchestrates the training loop (calls algorithm for data collection and updates, handles logging/eval/save)
Algorithm: Controls data collection process (interacts with environment, fills buffer, computes advantages/returns) and updates the policy (e.g., PPO)
Policy: Neural network models implementing a unified interface
Buffer: On-policy rollout storage and minibatch iterator (managed by algorithm)
Env Factory: Build environments from a JSON config via registry

The core components and their relationships:

Trainer → Policy, Env, Algorithm (via callbacks for statistics)
Algorithm → Policy, RolloutBuffer (algorithm manages its own buffer)

Configuration via JSON#

Training is configured via a JSON file that defines runtime settings, environment, policy, and algorithm parameters.

Example Configuration#

The configuration file (e.g., train_config.json) is located in configs/agents/rl/push_cube:

Configuration Sections#

Runtime Settings#

The runtime section controls experiment setup:

exp_name: Experiment name (used for output directories)
seed: Random seed for reproducibility
cuda: Whether to use GPU (default: true)
headless: Whether to run simulation in headless mode
iterations: Number of training iterations
rollout_steps: Steps per rollout (e.g., 1024)
eval_freq: Frequency of evaluation (in steps)
save_freq: Frequency of checkpoint saving (in steps)
use_wandb: Whether to enable Weights & Biases logging (set in JSON config)
wandb_project_name: Weights & Biases project name

Environment Configuration#

The env section defines the task environment:

id: Environment registry ID (e.g., “PushCubeRL”)
cfg: Environment-specific configuration parameters

Example:

"env": {
  "id": "PushCubeRL",
  "cfg": {
    "num_envs": 4,
    "obs_mode": "state",
    "episode_length": 100,
    "action_scale": 0.1,
    "success_threshold": 0.1
  }
}

Policy Configuration#

The policy section defines the neural network policy:

name: Policy name (e.g., “actor_critic”, “vla”)
cfg: Policy-specific hyperparameters (empty for actor_critic)
actor: Actor network configuration (required for actor_critic)
critic: Critic network configuration (required for actor_critic)

Example:

"policy": {
  "name": "actor_critic",
  "cfg": {},
  "actor": {
    "type": "mlp",
    "hidden_sizes": [256, 256],
    "activation": "relu"
  },
  "critic": {
    "type": "mlp",
    "hidden_sizes": [256, 256],
    "activation": "relu"
  }
}

Algorithm Configuration#

The algorithm section defines the RL algorithm:

name: Algorithm name (e.g., “ppo”)
cfg: Algorithm-specific hyperparameters

Example:

"algorithm": {
  "name": "ppo",
  "cfg": {
    "learning_rate": 0.0001,
    "n_epochs": 10,
    "batch_size": 64,
    "gamma": 0.99,
    "gae_lambda": 0.95,
    "clip_coef": 0.2,
    "ent_coef": 0.01,
    "vf_coef": 0.5,
    "max_grad_norm": 0.5
  }
}

Training Script#

The training script (train.py) is located in embodichain/agents/rl/:

Code for train.py

# ----------------------------------------------------------------------------
# Copyright (c) 2021-2025 DexForce Technology Co., Ltd.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ----------------------------------------------------------------------------

import argparse
import os
import time
from pathlib import Path

import numpy as np
import torch
import wandb
import json
from torch.utils.tensorboard import SummaryWriter
from copy import deepcopy

from embodichain.agents.rl.models import build_policy, get_registered_policy_names
from embodichain.agents.rl.models import build_mlp_from_cfg
from embodichain.agents.rl.algo import build_algo, get_registered_algo_names
from embodichain.agents.rl.utils.trainer import Trainer
from embodichain.utils import logger
from embodichain.lab.gym.envs.tasks.rl import build_env
from embodichain.lab.gym.utils.gym_utils import config_to_rl_cfg
from embodichain.utils.utility import load_json
from embodichain.utils.module_utils import find_function_from_modules
from embodichain.lab.sim import SimulationManagerCfg
from embodichain.lab.gym.envs.managers.cfg import EventCfg


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--config", type=str, required=True, help="Path to JSON config")
    args = parser.parse_args()

    with open(args.config, "r") as f:
        cfg_json = json.load(f)

    trainer_cfg = cfg_json["trainer"]
    policy_block = cfg_json["policy"]
    algo_block = cfg_json["algorithm"]

    # Runtime
    exp_name = trainer_cfg.get("exp_name", "generic_exp")
    seed = int(trainer_cfg.get("seed", 1))
    device_str = trainer_cfg.get("device", "cpu")
    iterations = int(trainer_cfg.get("iterations", 250))
    rollout_steps = int(trainer_cfg.get("rollout_steps", 2048))
    eval_freq = int(trainer_cfg.get("eval_freq", 10000))
    save_freq = int(trainer_cfg.get("save_freq", 50000))
    headless = bool(trainer_cfg.get("headless", True))
    wandb_project_name = trainer_cfg.get("wandb_project_name", "embodychain-generic")

    # Device
    if not isinstance(device_str, str):
        raise ValueError(
            f"runtime.device must be a string such as 'cpu' or 'cuda:0'. Got: {device_str!r}"
        )
    try:
        device = torch.device(device_str)
    except RuntimeError as exc:
        raise ValueError(
            f"Failed to parse runtime.device='{device_str}': {exc}"
        ) from exc

    if device.type == "cuda":
        if not torch.cuda.is_available():
            raise ValueError(
                "CUDA device requested but torch.cuda.is_available() is False."
            )
        index = (
            device.index if device.index is not None else torch.cuda.current_device()
        )
        device_count = torch.cuda.device_count()
        if index < 0 or index >= device_count:
            raise ValueError(
                f"CUDA device index {index} is out of range (available devices: {device_count})."
            )
        torch.cuda.set_device(index)
        device = torch.device(f"cuda:{index}")
    elif device.type != "cpu":
        raise ValueError(f"Unsupported device type: {device}")
    logger.log_info(f"Device: {device}")

    # Seeds
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    if device.type == "cuda":
        torch.cuda.manual_seed_all(seed)

    # Outputs
    run_stamp = time.strftime("%Y%m%d_%H%M%S")
    run_base = os.path.join("outputs", f"{exp_name}_{run_stamp}")
    log_dir = os.path.join(run_base, "logs")
    checkpoint_dir = os.path.join(run_base, "checkpoints")
    os.makedirs(log_dir, exist_ok=True)
    os.makedirs(checkpoint_dir, exist_ok=True)
    writer = SummaryWriter(f"{log_dir}/{exp_name}")

    # Initialize Weights & Biases (optional)
    use_wandb = trainer_cfg.get("use_wandb", False)

    # Initialize Weights & Biases (optional)
    if use_wandb:
        wandb.init(project=wandb_project_name, name=exp_name, config=cfg_json)

    gym_config_path = Path(trainer_cfg["gym_config"])
    logger.log_info(f"Current working directory: {Path.cwd()}")

    gym_config_data = load_json(str(gym_config_path))
    gym_env_cfg = config_to_rl_cfg(gym_config_data)

    # Ensure sim configuration mirrors runtime overrides
    if gym_env_cfg.sim_cfg is None:
        gym_env_cfg.sim_cfg = SimulationManagerCfg()
    if device.type == "cuda":
        gpu_index = device.index
        if gpu_index is None:
            gpu_index = torch.cuda.current_device()
        gym_env_cfg.sim_cfg.sim_device = torch.device(f"cuda:{gpu_index}")
        if hasattr(gym_env_cfg.sim_cfg, "gpu_id"):
            gym_env_cfg.sim_cfg.gpu_id = gpu_index
    else:
        gym_env_cfg.sim_cfg.sim_device = torch.device("cpu")
    gym_env_cfg.sim_cfg.headless = headless

    logger.log_info(
        f"Loaded gym_config from {gym_config_path} (env_id={gym_env_cfg.env_id}, headless={gym_env_cfg.sim_cfg.headless}, sim_device={gym_env_cfg.sim_cfg.sim_device})"
    )

    env = build_env(gym_env_cfg.env_id, base_env_cfg=gym_env_cfg)

    eval_gym_env_cfg = deepcopy(gym_env_cfg)
    eval_gym_env_cfg.num_envs = 4
    eval_gym_env_cfg.sim_cfg.headless = True

    eval_env = build_env(eval_gym_env_cfg.env_id, base_env_cfg=eval_gym_env_cfg)

    # Build Policy via registry
    policy_name = policy_block["name"]
    # Build Policy via registry (actor/critic must be explicitly defined in JSON when using actor_critic)
    if policy_name.lower() == "actor_critic":
        obs_dim = env.observation_space.shape[-1]
        action_dim = env.action_space.shape[-1]

        actor_cfg = policy_block.get("actor")
        critic_cfg = policy_block.get("critic")
        if actor_cfg is None or critic_cfg is None:
            raise ValueError(
                "ActorCritic requires 'actor' and 'critic' definitions in JSON (policy.actor / policy.critic)."
            )

        actor = build_mlp_from_cfg(actor_cfg, obs_dim, action_dim)
        critic = build_mlp_from_cfg(critic_cfg, obs_dim, 1)

        policy = build_policy(
            policy_block,
            env.observation_space,
            env.action_space,
            device,
            actor=actor,
            critic=critic,
        )
    else:
        policy = build_policy(
            policy_block, env.observation_space, env.action_space, device
        )

    # Build Algorithm via factory
    algo_name = algo_block["name"].lower()
    algo_cfg = algo_block["cfg"]
    algo = build_algo(algo_name, algo_cfg, policy, device)

    # Build Trainer
    event_modules = [
        "embodichain.lab.gym.envs.managers.randomization",
        "embodichain.lab.gym.envs.managers.record",
        "embodichain.lab.gym.envs.managers.events",
    ]
    events_dict = trainer_cfg.get("events", {})
    train_event_cfg = {}
    eval_event_cfg = {}
    # Parse train events
    for event_name, event_info in events_dict.get("train", {}).items():
        event_func_str = event_info.get("func")
        mode = event_info.get("mode", "interval")
        params = event_info.get("params", {})
        interval_step = event_info.get("interval_step", 1)
        event_func = find_function_from_modules(
            event_func_str, event_modules, raise_if_not_found=True
        )
        train_event_cfg[event_name] = EventCfg(
            func=event_func,
            mode=mode,
            params=params,
            interval_step=interval_step,
        )
    # Parse eval events
    for event_name, event_info in events_dict.get("eval", {}).items():
        event_func_str = event_info.get("func")
        mode = event_info.get("mode", "interval")
        params = event_info.get("params", {})
        interval_step = event_info.get("interval_step", 1)
        event_func = find_function_from_modules(
            event_func_str, event_modules, raise_if_not_found=True
        )
        eval_event_cfg[event_name] = EventCfg(
            func=event_func,
            mode=mode,
            params=params,
            interval_step=interval_step,
        )
    trainer = Trainer(
        policy=policy,
        env=env,
        algorithm=algo,
        num_steps=rollout_steps,
        batch_size=algo_cfg["batch_size"],
        writer=writer,
        eval_freq=eval_freq,
        save_freq=save_freq,
        checkpoint_dir=checkpoint_dir,
        exp_name=exp_name,
        use_wandb=use_wandb,
        eval_env=eval_env,
        event_cfg=train_event_cfg,
        eval_event_cfg=eval_event_cfg,
    )

    logger.log_info("Generic training initialized")
    logger.log_info(f"Task: {type(env).__name__}")
    logger.log_info(
        f"Policy: {policy_name} (available: {get_registered_policy_names()})"
    )
    logger.log_info(
        f"Algorithm: {algo_name} (available: {get_registered_algo_names()})"
    )

    total_steps = int(iterations * rollout_steps * env.num_envs)
    logger.log_info(f"Total steps: {total_steps} (iterations≈{iterations})")

    try:
        trainer.train(total_steps)
    except KeyboardInterrupt:
        logger.log_info("Training interrupted by user")
    finally:
        trainer.save_checkpoint()
        writer.close()
        if use_wandb:
            try:
                wandb.finish()
            except Exception:
                pass
        logger.log_info("Training finished")


if __name__ == "__main__":
    main()

The Script Explained#

The training script performs the following steps:

Parse Configuration: Loads JSON config and extracts runtime/env/policy/algorithm blocks
Setup: Initializes device, seeds, output directories, TensorBoard, and Weights & Biases
Build Components: - Environment via build_env() factory - Policy via build_policy() registry - Algorithm via build_algo() factory
Create Trainer: Instantiates the Trainer with all components
Train: Runs the training loop until completion

Launching Training#

To start training, run:

python embodichain/agents/rl/train.py --config configs/agents/rl/push_cube/train_config.json

Outputs#

All outputs are written to ./outputs/<exp_name>_<timestamp>/:

logs/: TensorBoard logs
checkpoints/: Model checkpoints

Training Process#

The training process follows this sequence:

Rollout Phase: Algorithm collects trajectories by interacting with the environment (via collect_rollout). During this phase, the trainer performs dense per-step logging of rewards and metrics from environment info.
GAE Computation: Algorithm computes advantages and returns using Generalized Advantage Estimation (internal to algorithm, stored in buffer extras)
Update Phase: Algorithm updates the policy using collected data (e.g., PPO)
Logging: Trainer logs training losses and aggregated metrics to TensorBoard and Weights & Biases
Evaluation (periodic): Trainer evaluates the current policy
Checkpointing (periodic): Trainer saves model checkpoints

Policy Interface#

All policies must inherit from the Policy abstract base class:

from abc import ABC, abstractmethod
import torch.nn as nn

class Policy(nn.Module, ABC):
    device: torch.device

    @abstractmethod
    def get_action(
        self, obs: torch.Tensor, deterministic: bool = False
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """Returns (action, log_prob, value)"""
        raise NotImplementedError

    @abstractmethod
    def get_value(self, obs: torch.Tensor) -> torch.Tensor:
        """Returns value estimate"""
        raise NotImplementedError

    @abstractmethod
    def evaluate_actions(
        self, obs: torch.Tensor, actions: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """Returns (log_prob, entropy, value)"""
        raise NotImplementedError

Available Policies#

ActorCritic: MLP-based Gaussian policy with learnable log_std. Requires external actor and critic modules to be provided (defined in JSON config).
VLAPlaceholderPolicy: Placeholder for Vision-Language-Action policies

Algorithms#

Available Algorithms#

PPO: Proximal Policy Optimization with GAE

Adding a New Algorithm#

To add a new algorithm:

Create a new algorithm class in embodichain/agents/rl/algo/
Implement initialize_buffer(), collect_rollout(), and update() methods
Register in algo/__init__.py:

from embodichain.agents.rl.algo import BaseAlgorithm, register_algo
from embodichain.agents.rl.buffer import RolloutBuffer

@register_algo("my_algo")
class MyAlgorithm(BaseAlgorithm):
    def __init__(self, cfg, policy):
        self.cfg = cfg
        self.policy = policy
        self.device = torch.device(cfg.device)
        self.buffer = None

    def initialize_buffer(self, num_steps, num_envs, obs_dim, action_dim):
        """Initialize the algorithm's buffer."""
        self.buffer = RolloutBuffer(num_steps, num_envs, obs_dim, action_dim, self.device)

    def collect_rollout(self, env, policy, obs, num_steps, on_step_callback=None):
        """Control data collection process (interact with env, fill buffer, compute advantages/returns)."""
        # Collect trajectories
        # Compute advantages/returns (e.g., GAE for on-policy algorithms)
        # Attach extras to buffer: self.buffer.set_extras({"advantages": adv, "returns": ret})
        # Return empty dict (dense logging handled in trainer)
        return {}

    def update(self):
        """Update the policy using collected data."""
        # Access extras from buffer: self.buffer._extras.get("advantages")
        # Use self.buffer to update policy
        return {"loss": 0.0}

Adding a New Policy#

To add a new policy:

Create a new policy class inheriting from the Policy abstract base class
Register in models/__init__.py:

from embodichain.agents.rl.models import register_policy, Policy

@register_policy("my_policy")
class MyPolicy(Policy):
    def __init__(self, obs_space, action_space, device, config):
        super().__init__()
        self.device = device
        # Initialize your networks here

    def get_action(self, obs, deterministic=False):
        ...
    def get_value(self, obs):
        ...
    def evaluate_actions(self, obs, actions):
        ...

Adding a New Environment#

To add a new RL environment:

Create an environment class inheriting from EmbodiedEnv
Register it with the Gymnasium registry:

from embodichain.lab.gym.utils.registration import register_env

@register_env("MyTaskRL", max_episode_steps=100, override=True)
class MyTaskEnv(EmbodiedEnv):
    cfg: MyTaskEnvCfg
    ...

Use the environment ID in your JSON config:

"env": {
  "id": "MyTaskRL",
  "cfg": {
    ...
  }
}

Best Practices#

Device Management: Device is single-sourced from runtime.cuda. All components (trainer/algorithm/policy/env) share the same device.
Action Scaling: Keep action scaling in the environment, not in the policy.
Observation Format: Environments should provide consistent observation shape/types (torch.float32) and a single done = terminated | truncated.
Algorithm Interface: Algorithms must implement initialize_buffer(), collect_rollout(), and update() methods. The algorithm completely controls data collection and buffer management.
Reward Components: Organize reward components in info["rewards"] dictionary and metrics in info["metrics"] dictionary. The trainer performs dense per-step logging directly from environment info.
Configuration: Use JSON for all hyperparameters. This makes experiments reproducible and easy to track.
Logging: Metrics are automatically logged to TensorBoard and Weights & Biases. Check outputs/<exp_name>/logs/ for TensorBoard logs.
Checkpoints: Regular checkpoints are saved to outputs/<exp_name>/checkpoints/. Use these to resume training or evaluate policies.

Reinforcement Learning Training

Contents

Reinforcement Learning Training#

Overview#

Architecture#

Configuration via JSON#

Example Configuration#

Configuration Sections#

Runtime Settings#

Environment Configuration#

Policy Configuration#

Algorithm Configuration#

Training Script#

The Script Explained#

Launching Training#

Outputs#

Training Process#

Policy Interface#

Available Policies#

Algorithms#

Available Algorithms#

Adding a New Algorithm#

Adding a New Policy#

Adding a New Environment#

Best Practices#