Reinforcement Learning Training

Reinforcement Learning Training#

This tutorial shows you how to train reinforcement learning agents using EmbodiChain’s RL framework. You’ll learn how to configure training via JSON or YAML, set up environments, policies, and algorithms, and launch training sessions.

Overview#

The RL framework provides a modular, extensible stack for robotics tasks:

Trainer: Orchestrates the training loop (calls algorithm for data collection and updates, handles logging/eval/save)
Algorithm: Controls data collection process (interacts with environment, fills buffer, computes advantages/returns) and updates the policy (e.g., PPO)
Policy: Neural network models implementing a unified interface (get_action/get_value/evaluate_actions)
Buffer: On-policy rollout storage and minibatch iterator (managed by algorithm)
Env Factory: Build environments from a JSON or YAML config via registry

Architecture#

The framework follows a clean separation of concerns:

Trainer: Orchestrates the training loop (calls algorithm for data collection and updates, handles logging/eval/save)
Algorithm: Controls data collection process (interacts with environment, fills buffer, computes advantages/returns) and updates the policy (e.g., PPO)
Policy: Neural network models implementing a unified interface
Buffer: On-policy rollout storage and minibatch iterator (managed by algorithm)
Env Factory: Build environments from a JSON or YAML config via registry

The core components and their relationships:

Trainer → Policy, Env, Algorithm (via callbacks for statistics)
Algorithm → Policy, RolloutBuffer (algorithm manages its own buffer)

Configuration via JSON or YAML#

Training is configured via a JSON or YAML file that defines runtime settings, environment, policy, and algorithm parameters. EmbodiChain loads either format with load_config(); the nested trainer.gym_config path supports the same extensions.

Example Configuration#

The configuration file (e.g., train_config.json or train_config.yaml) is located in configs/agents/rl/push_cube or configs/agents/rl/basic/cart_pole:

Configuration Sections#

Runtime Settings#

The trainer section controls experiment setup:

exp_name: Experiment name (used for output directories)
seed: Random seed for reproducibility
device: Runtime device string, e.g. "cpu" or "cuda:0"
headless: Whether to run simulation in headless mode
iterations: Number of training iterations
buffer_size: Steps collected per rollout (e.g., 1024)
eval_freq: Frequency of evaluation (in steps)
save_freq: Frequency of checkpoint saving (in steps)
use_wandb: Whether to enable Weights & Biases logging (set in the config file)
wandb_project_name: Weights & Biases project name

Environment Configuration#

The env section defines the task environment:

id: Environment registry ID (e.g., “PushCubeRL”)
cfg: Environment-specific configuration parameters

For RL environments, use the actions field for action preprocessing and extensions for task-specific parameters:

actions: Action Manager config (e.g., DeltaQposTerm with scale)
extensions: Task-specific parameters (e.g., success_threshold)

Example:

"env": {
  "id": "PushCubeRL",
  "cfg": {
    "num_envs": 4,
    "actions": {
      "delta_qpos": {
        "func": "DeltaQposTerm",
        "params": { "scale": 0.1 }
      }
    },
    "extensions": {
      "success_threshold": 0.1
    }
  }
}

Policy Configuration#

The policy section defines the neural network policy:

name: Policy name (e.g., “actor_critic”, “vla”)
action_dim: Optional policy output action dimension. If omitted, it is inferred from env.action_space.
actor: Actor network configuration (required for actor_critic)
critic: Critic network configuration (required for actor_critic)

Example:

"policy": {
  "name": "actor_critic",
  "actor": {
    "type": "mlp",
    "network_cfg": {
      "hidden_sizes": [256, 256],
      "activation": "relu"
    }
  },
  "critic": {
    "type": "mlp",
    "network_cfg": {
      "hidden_sizes": [256, 256],
      "activation": "relu"
    }
  }
}

Algorithm Configuration#

The algorithm section defines the RL algorithm:

name: Algorithm name (e.g., “ppo”, “grpo”)
cfg: Algorithm-specific hyperparameters

PPO example:

"algorithm": {
  "name": "ppo",
  "cfg": {
    "learning_rate": 0.0001,
    "n_epochs": 10,
    "batch_size": 64,
    "gamma": 0.99,
    "gae_lambda": 0.95,
    "clip_coef": 0.2,
    "ent_coef": 0.01,
    "vf_coef": 0.5,
    "max_grad_norm": 0.5
  }
}

GRPO example (for Embodied AI / from-scratch training, e.g. CartPole):

"algorithm": {
  "name": "grpo",
  "cfg": {
    "learning_rate": 0.0001,
    "n_epochs": 10,
    "batch_size": 8192,
    "gamma": 0.99,
    "clip_coef": 0.2,
    "ent_coef": 0.001,
    "kl_coef": 0,
    "group_size": 4,
    "eps": 1e-8,
    "reset_every_rollout": true,
    "max_grad_norm": 0.5,
    "truncate_at_first_done": true
  }
}

For GRPO: use actor_only policy. Set kl_coef=0 for from-scratch training; kl_coef=0.02 for VLA/LLM fine-tuning.

Training Script#

The training script (train.py) is located in embodichain/agents/rl/:

Code for train.py

# ----------------------------------------------------------------------------
# Copyright (c) 2021-2026 DexForce Technology Co., Ltd.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ----------------------------------------------------------------------------

import argparse
import os
import time
from pathlib import Path

import numpy as np
import torch
import wandb
from torch.utils.tensorboard import SummaryWriter
from copy import deepcopy

from embodichain.agents.rl.models import build_policy, get_registered_policy_names
from embodichain.agents.rl.models import build_mlp_from_cfg
from embodichain.agents.rl.algo import build_algo, get_registered_algo_names
from embodichain.agents.rl.utils import dict_to_tensordict, flatten_dict_observation
from embodichain.agents.rl.utils.trainer import Trainer
from embodichain.utils import logger
from embodichain.lab.gym.envs.tasks.rl import build_env
from embodichain.lab.gym.utils.gym_utils import config_to_cfg, DEFAULT_MANAGER_MODULES
from embodichain.utils.utility import load_config
from embodichain.utils.module_utils import find_function_from_modules
from embodichain.lab.sim import SimulationManagerCfg
from embodichain.lab.sim.cfg import RenderCfg
from embodichain.lab.gym.envs.managers.cfg import EventCfg


def parse_args():
    """Parse command line arguments."""
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--config",
        type=str,
        required=True,
        help="Path to training config file (.json, .yaml, or .yml).",
    )
    parser.add_argument(
        "--distributed",
        action=argparse.BooleanOptionalAction,
        default=None,
        help="Enable or disable multi-GPU distributed training",
    )
    return parser.parse_args()


def train_from_config(config_path: str, distributed: bool | None = None):
    """Run training from a config file path.

    Args:
        config_path: Path to the training config file (.json, .yaml, or .yml).
        distributed: If True, run multi-GPU distributed training.
            If None, use trainer.distributed from config.
    """
    cfg_data = load_config(config_path)

    trainer_cfg = cfg_data["trainer"]
    policy_block = cfg_data["policy"]
    algo_block = cfg_data["algorithm"]

    # Resolve distributed flag
    if distributed is None:
        distributed = bool(trainer_cfg.get("distributed", False))

    # Distributed setup
    rank = 0
    world_size = 1
    local_rank = 0
    if distributed:
        if not torch.distributed.is_available():
            raise RuntimeError(
                "Distributed training requested but torch.distributed is not available."
            )
        if not torch.cuda.is_available():
            raise RuntimeError(
                "Distributed training with NCCL backend requires CUDA, "
                "but torch.cuda.is_available() is False."
            )
        local_rank = int(os.environ.get("LOCAL_RANK", 0))
        if local_rank < 0 or local_rank >= torch.cuda.device_count():
            raise ValueError(
                f"LOCAL_RANK {local_rank} is out of range "
                f"(available GPUs: {torch.cuda.device_count()})."
            )
        torch.cuda.set_device(local_rank)
        if not torch.distributed.is_initialized():
            torch.distributed.init_process_group(backend="nccl")
        rank = torch.distributed.get_rank()
        world_size = torch.distributed.get_world_size()

    # Runtime
    exp_name = trainer_cfg.get("exp_name", "generic_exp")
    seed = int(trainer_cfg.get("seed", 1))
    device_str = trainer_cfg.get("device", "cpu")
    if distributed:
        device_str = f"cuda:{local_rank}"
    iterations = int(trainer_cfg.get("iterations", 250))
    buffer_size = int(
        trainer_cfg.get("buffer_size", trainer_cfg.get("rollout_steps", 2048))
    )
    enable_eval = bool(trainer_cfg.get("enable_eval", False))
    eval_freq = int(trainer_cfg.get("eval_freq", 10000))
    save_freq = int(trainer_cfg.get("save_freq", 50000))
    num_eval_episodes = int(trainer_cfg.get("num_eval_episodes", 5))
    headless = bool(trainer_cfg.get("headless", True))
    renderer = trainer_cfg.get("renderer", "hybrid")
    gpu_id = int(trainer_cfg.get("gpu_id", 0))
    num_envs = trainer_cfg.get("num_envs", None)
    wandb_project_name = trainer_cfg.get("wandb_project_name", "embodichain-generic")

    # Device
    if not isinstance(device_str, str):
        raise ValueError(
            f"runtime.device must be a string such as 'cpu' or 'cuda:0'. Got: {device_str!r}"
        )
    try:
        device = torch.device(device_str)
    except RuntimeError as exc:
        raise ValueError(
            f"Failed to parse runtime.device='{device_str}': {exc}"
        ) from exc

    if device.type == "cuda":
        if not torch.cuda.is_available():
            raise ValueError(
                "CUDA device requested but torch.cuda.is_available() is False."
            )
        index = (
            device.index if device.index is not None else torch.cuda.current_device()
        )
        device_count = torch.cuda.device_count()
        if index < 0 or index >= device_count:
            raise ValueError(
                f"CUDA device index {index} is out of range (available devices: {device_count})."
            )
        torch.cuda.set_device(index)
        device = torch.device(f"cuda:{index}")
    elif device.type != "cpu":
        raise ValueError(f"Unsupported device type: {device}")
    if rank == 0:
        logger.log_info(f"Device: {device}")
    if distributed and rank == 0:
        logger.log_info(f"Distributed training: world_size={world_size}")

    # Seeds
    effective_seed = seed + rank
    np.random.seed(effective_seed)
    torch.manual_seed(effective_seed)
    torch.backends.cudnn.deterministic = True
    if device.type == "cuda":
        torch.cuda.manual_seed_all(effective_seed)

    # Outputs
    if distributed:
        run_stamp = time.strftime("%Y%m%d_%H%M%S") if rank == 0 else None
        run_stamp_list = [run_stamp]
        torch.distributed.broadcast_object_list(run_stamp_list, src=0)
        run_stamp = run_stamp_list[0]
    else:
        run_stamp = time.strftime("%Y%m%d_%H%M%S")
    run_base = os.path.join("outputs", f"{exp_name}_{run_stamp}")
    log_dir = os.path.join(run_base, "logs")
    checkpoint_dir = os.path.join(run_base, "checkpoints")
    if rank == 0:
        os.makedirs(log_dir, exist_ok=True)
        os.makedirs(checkpoint_dir, exist_ok=True)
    writer = SummaryWriter(f"{log_dir}/{exp_name}") if rank == 0 else None

    # Initialize Weights & Biases (optional)
    use_wandb = trainer_cfg.get("use_wandb", False)
    if use_wandb and rank == 0:
        wandb.init(project=wandb_project_name, name=exp_name, config=cfg_data)

    gym_config_path = Path(trainer_cfg["gym_config"])
    if rank == 0:
        logger.log_info(f"Current working directory: {Path.cwd()}")

    gym_config_data = load_config(str(gym_config_path))
    gym_env_cfg = config_to_cfg(
        gym_config_data, manager_modules=DEFAULT_MANAGER_MODULES
    )
    if num_envs is not None:
        gym_env_cfg.num_envs = int(num_envs)

    # Ensure sim configuration mirrors runtime overrides
    if gym_env_cfg.sim_cfg is None:
        gym_env_cfg.sim_cfg = SimulationManagerCfg()
    if device.type == "cuda":
        gpu_index = device.index
        if gpu_index is None:
            gpu_index = torch.cuda.current_device()
        gym_env_cfg.sim_cfg.sim_device = torch.device(f"cuda:{gpu_index}")
        if hasattr(gym_env_cfg.sim_cfg, "gpu_id"):
            gym_env_cfg.sim_cfg.gpu_id = gpu_index
    else:
        gym_env_cfg.sim_cfg.sim_device = torch.device("cpu")
    gym_env_cfg.sim_cfg.headless = headless
    gym_env_cfg.sim_cfg.render_cfg = RenderCfg(renderer=renderer)
    gym_env_cfg.sim_cfg.gpu_id = gpu_id
    logger.log_info(
        f"Loaded gym_config from {gym_config_path} (env_id={gym_config_data['id']}, num_envs={gym_env_cfg.num_envs}, headless={gym_env_cfg.sim_cfg.headless}, renderer={gym_env_cfg.sim_cfg.render_cfg.renderer}, sim_device={gym_env_cfg.sim_cfg.sim_device})"
    )

    env = build_env(gym_config_data["id"], base_env_cfg=gym_env_cfg)
    sample_obs, _ = env.reset()
    sample_obs_td = dict_to_tensordict(sample_obs, device)
    obs_dim = flatten_dict_observation(sample_obs_td).shape[-1]
    flat_obs_space = env.flattened_observation_space

    # Create evaluation environment only if enabled
    eval_env = None
    num_eval_envs = trainer_cfg.get("num_eval_envs", 4)
    if enable_eval and rank == 0:
        eval_gym_env_cfg = deepcopy(gym_env_cfg)
        eval_gym_env_cfg.num_envs = num_eval_envs
        eval_gym_env_cfg.sim_cfg.headless = True
        eval_env = build_env(gym_config_data["id"], base_env_cfg=eval_gym_env_cfg)
        logger.log_info(
            f"Evaluation environment created (num_envs={num_eval_envs}, headless=True)"
        )

    # Build Policy via registry
    policy_name = policy_block["name"]
    env_action_dim = (
        env.get_wrapper_attr("action_manager").total_action_dim
        if env.get_wrapper_attr("action_manager") is not None
        else len(env.get_wrapper_attr("active_joint_ids"))
    )
    action_dim = policy_block.get("action_dim", env_action_dim)
    action_dim = int(action_dim)
    if action_dim != env_action_dim:
        raise ValueError(
            f"Configured policy.action_dim={action_dim} does not match env action dim {env_action_dim}."
        )
    # Build Policy via registry (actor/critic must be explicitly defined in JSON when using actor_critic/actor_only)
    if policy_name.lower() == "actor_critic":
        actor_cfg = policy_block.get("actor")
        critic_cfg = policy_block.get("critic")
        if actor_cfg is None or critic_cfg is None:
            raise ValueError(
                "ActorCritic requires 'actor' and 'critic' definitions in JSON (policy.actor / policy.critic)."
            )

        actor = build_mlp_from_cfg(actor_cfg, obs_dim, action_dim)
        critic = build_mlp_from_cfg(critic_cfg, obs_dim, 1)

        policy = build_policy(
            policy_block,
            flat_obs_space,
            env.action_space,
            device,
            actor=actor,
            critic=critic,
        )
    elif policy_name.lower() == "actor_only":
        actor_cfg = policy_block.get("actor")
        if actor_cfg is None:
            raise ValueError(
                "ActorOnly requires 'actor' definition in JSON (policy.actor)."
            )

        actor = build_mlp_from_cfg(actor_cfg, obs_dim, action_dim)

        policy = build_policy(
            policy_block,
            flat_obs_space,
            env.action_space,
            device,
            actor=actor,
        )
    else:
        policy = build_policy(
            policy_block, env.observation_space, env.action_space, device
        )

    # Build Algorithm via factory
    algo_name = algo_block["name"].lower()
    algo_cfg = algo_block["cfg"]
    algo = build_algo(
        algo_name,
        algo_cfg,
        policy,
        device,
        distributed=distributed,
    )

    # Build Trainer
    event_modules = [
        "embodichain.lab.gym.envs.managers.randomization",
        "embodichain.lab.gym.envs.managers.record",
        "embodichain.lab.gym.envs.managers.events",
    ]
    events_dict = trainer_cfg.get("events", {})
    train_event_cfg = {}
    eval_event_cfg = {}
    # Parse train events
    for event_name, event_info in events_dict.get("train", {}).items():
        event_func_str = event_info.get("func")
        mode = event_info.get("mode", "interval")
        params = event_info.get("params", {})
        interval_step = event_info.get("interval_step", 1)
        event_func = find_function_from_modules(
            event_func_str, event_modules, raise_if_not_found=True
        )
        train_event_cfg[event_name] = EventCfg(
            func=event_func,
            mode=mode,
            params=params,
            interval_step=interval_step,
        )
    # Parse eval events (only if evaluation is enabled)
    if enable_eval:
        for event_name, event_info in events_dict.get("eval", {}).items():
            event_func_str = event_info.get("func")
            mode = event_info.get("mode", "interval")
            params = event_info.get("params", {})
            interval_step = event_info.get("interval_step", 1)
            event_func = find_function_from_modules(
                event_func_str, event_modules, raise_if_not_found=True
            )
            eval_event_cfg[event_name] = EventCfg(
                func=event_func,
                mode=mode,
                params=params,
                interval_step=interval_step,
            )
    trainer = Trainer(
        policy=policy,
        env=env,
        algorithm=algo,
        buffer_size=buffer_size,
        batch_size=algo_cfg["batch_size"],
        writer=writer,
        eval_freq=eval_freq if enable_eval else 0,  # Disable eval if not enabled
        save_freq=save_freq,
        checkpoint_dir=checkpoint_dir,
        exp_name=exp_name,
        use_wandb=use_wandb,
        eval_env=eval_env,  # None if enable_eval=False
        event_cfg=train_event_cfg,
        eval_event_cfg=eval_event_cfg if (enable_eval and rank == 0) else {},
        num_eval_episodes=num_eval_episodes,
        distributed=distributed,
        rank=rank,
        world_size=world_size,
    )

    if rank == 0:
        logger.log_info("Generic training initialized")
        logger.log_info(f"Task: {type(env).__name__}")
        logger.log_info(
            f"Policy: {policy_name} (available: {get_registered_policy_names()})"
        )
        logger.log_info(
            f"Algorithm: {algo_name} (available: {get_registered_algo_names()})"
        )

    total_steps = int(iterations * buffer_size * env.num_envs * world_size)
    if rank == 0:
        logger.log_info(
            f"Total steps: {total_steps} (iterations≈{iterations}, world_size={world_size})"
        )

    try:
        trainer.train(total_steps)
    except KeyboardInterrupt:
        if rank == 0:
            logger.log_info("Training interrupted by user")
    finally:
        trainer.save_checkpoint()
        if writer is not None:
            writer.close()
        if use_wandb and rank == 0:
            try:
                wandb.finish()
            except Exception:
                pass

        # Clean up environments to prevent resource leaks
        try:
            if env is not None:
                env.close()
        except Exception as e:
            if rank == 0:
                logger.log_warning(f"Failed to close training environment: {e}")

        try:
            if eval_env is not None:
                eval_env.close()
        except Exception as e:
            if rank == 0:
                logger.log_warning(f"Failed to close evaluation environment: {e}")

        if distributed and torch.distributed.is_initialized():
            torch.distributed.destroy_process_group()

        if rank == 0:
            logger.log_info("Training finished")


def cli() -> None:
    """Command-line interface for RL training.

    Parses CLI arguments and launches training from a config file.
    """
    args = parse_args()
    train_from_config(args.config, distributed=args.distributed)


if __name__ == "__main__":
    cli()

The Script Explained#

The training script performs the following steps:

Parse Configuration: Loads the config file (.json, .yaml, or .yml) and extracts runtime/env/policy/algorithm blocks
Setup: Initializes device, seeds, output directories, TensorBoard, and Weights & Biases
Build Components: - Environment via build_env() factory - Policy via build_policy() registry - Algorithm via build_algo() factory
Create Trainer: Instantiates the Trainer with all components
Train: Runs the training loop until completion

Launching Training#

To start training, run:

python -m embodichain train-rl --config configs/agents/rl/basic/cart_pole/train_config.yaml

JSON configs are also supported:

python -m embodichain train-rl --config configs/agents/rl/push_cube/train_config.json

Outputs#

All outputs are written to ./outputs/<exp_name>_<timestamp>/:

logs/: TensorBoard logs
checkpoints/: Model checkpoints

Training Process#

The training process follows this sequence:

Rollout Phase: SyncCollector interacts with the environment and writes policy-side fields into a shared rollout TensorDict with uniform [N, T + 1] layout. EmbodiedEnv writes environment-side step fields such as reward, done, terminated, and truncated into the same rollout via set_rollout_buffer(). The final slot of transition-only fields is reserved as padding, while obs[:, -1] and value[:, -1] remain valid bootstrap data.
Advantage/Return Computation: Algorithm computes advantages and returns from the collected rollout (e.g. GAE for PPO, step-wise group normalization for GRPO) and converts it to a transition-aligned view over the valid first T steps before minibatch optimization.
Update Phase: Algorithm updates the policy with update(rollout)
Logging: Trainer logs training losses and aggregated metrics to TensorBoard and Weights & Biases
Evaluation (periodic): Trainer evaluates the current policy
Checkpointing (periodic): Trainer saves model checkpoints

Policy Interface#

All policies must inherit from the Policy abstract base class:

from abc import ABC, abstractmethod
import torch.nn as nn

class Policy(nn.Module, ABC):
    device: torch.device

    def get_action(self, tensordict, deterministic: bool = False):
        """Samples action, sample_log_prob, and value into the TensorDict."""
        ...

    @abstractmethod
    def forward(self, tensordict, deterministic: bool = False):
        """Writes action, sample_log_prob, and value into the TensorDict."""
        raise NotImplementedError

    @abstractmethod
    def get_value(self, tensordict):
        """Writes value estimate into the TensorDict."""
        raise NotImplementedError

    @abstractmethod
    def evaluate_actions(self, tensordict):
        """Returns a new TensorDict with log_prob, entropy, and value."""
        raise NotImplementedError

Available Policies#

ActorCritic: MLP-based Gaussian policy with learnable log_std. Requires external actor and critic modules to be provided (defined in the training config file). Used with PPO.
ActorOnly: Actor-only policy without Critic. Used with GRPO (group-relative advantage estimation).
VLAPlaceholderPolicy: Placeholder for Vision-Language-Action policies

Algorithms#

Available Algorithms#

PPO: Proximal Policy Optimization with GAE
GRPO: Group Relative Policy Optimization (no Critic, step-wise returns, masked group normalization). Use actor_only policy. Set kl_coef=0 for from-scratch training (CartPole, dense reward); kl_coef=0.02 for VLA/LLM fine-tuning.

Adding a New Algorithm#

To add a new algorithm:

Create a new algorithm class in embodichain/agents/rl/algo/
Implement update(rollout) and consume the shared rollout TensorDict
Register in algo/__init__.py:

from tensordict import TensorDict
from embodichain.agents.rl.algo import BaseAlgorithm, register_algo

@register_algo("my_algo")
class MyAlgorithm(BaseAlgorithm):
    def __init__(self, cfg, policy):
        self.cfg = cfg
        self.policy = policy
        self.device = torch.device(cfg.device)

    def update(self, rollout: TensorDict):
        """Update the policy using a collected rollout."""
        # compute advantages / returns from rollout
        # optimize policy parameters
        return {"loss": 0.0}

Adding a New Policy#

To add a new policy:

Create a new policy class inheriting from the Policy abstract base class
Register in models/__init__.py:

from embodichain.agents.rl.models import register_policy, Policy

@register_policy("my_policy")
class MyPolicy(Policy):
    def __init__(self, obs_dim, action_dim, device, config):
        super().__init__()
        self.device = device
        # Initialize your networks here

    def get_action(self, tensordict, deterministic=False):
        ...
    def forward(self, tensordict, deterministic=False):
        ...
    def get_value(self, tensordict):
        ...
    def evaluate_actions(self, tensordict):
        ...

Current built-in MLP policies use flattened observations in the training path. If your policy requires structured or multi-modal inputs, keep the richer obs_space interface and define a matching rollout/collector schema.

Adding a New Environment#

To add a new RL environment:

Create an environment class inheriting from EmbodiedEnv (with Action Manager configured for action preprocessing and standardized info structure):

from embodichain.lab.gym.envs import EmbodiedEnv, EmbodiedEnvCfg
from embodichain.lab.gym.utils.registration import register_env
import torch

@register_env("MyTaskRL", override=True)
class MyTaskEnv(EmbodiedEnv):
    def __init__(self, cfg: EmbodiedEnvCfg = None, **kwargs):
        super().__init__(cfg, **kwargs)

    def compute_task_state(self, **kwargs):
        """Compute success/failure conditions and metrics."""
        is_success = ...  # Define success condition
        is_fail = torch.zeros_like(is_success)
        metrics = {"distance": ..., "error": ...}
        return is_success, is_fail, metrics

Configure the environment in your config file with actions and extensions:

"env": {
  "id": "MyTaskRL",
  "cfg": {
    "num_envs": 4,
    "actions": {
      "delta_qpos": {
        "func": "DeltaQposTerm",
        "params": { "scale": 0.1 }
      }
    },
    "extensions": {
      "success_threshold": 0.05
    }
  }
}

The EmbodiedEnv with Action Manager provides:

Action Preprocessing: Configurable via actions (DeltaQposTerm, QposTerm, EefPoseTerm, etc.)
Standardized Info: Implements get_info() using compute_task_state() template method

Best Practices#

Use EmbodiedEnv with Action Manager for RL Tasks: Inherit from EmbodiedEnv and configure actions in your config. The Action Manager handles action preprocessing (delta_qpos, qpos, qvel, qf, eef_pose) in a modular way.
Action Configuration: Use the actions field in your config file. Example: "delta_qpos": {"func": "DeltaQposTerm", "params": {"scale": 0.1}}.
Device Management: Device is single-sourced from runtime.cuda. All components (trainer/algorithm/policy/env) share the same device.
Observation Format: Environments should provide consistent observation shape/types (torch.float32) and a single done = terminated | truncated.
Algorithm Interface: Algorithms implement update(rollout) and consume a shared rollout TensorDict. Collection is handled by SyncCollector plus environment-side rollout writes in EmbodiedEnv.
Reward Configuration: Use the RewardManager in your environment config to define reward components. Organize reward components in info["rewards"] dictionary and metrics in info["metrics"] dictionary. The trainer performs dense per-step logging directly from environment info.
Template Methods: Override compute_task_state() to define success/failure conditions and metrics. Override check_truncated() for custom truncation logic.
Configuration: Use JSON for all hyperparameters. This makes experiments reproducible and easy to track.
Logging: Metrics are automatically logged to TensorBoard and Weights & Biases. Check outputs/<exp_name>/logs/ for TensorBoard logs.
Checkpoints: Regular checkpoints are saved to outputs/<exp_name>/checkpoints/. Use these to resume training or evaluate policies.