.. _tutorial_data_generation:

Data Generation
===============

.. currentmodule:: embodichain.lab.gym

This tutorial shows how to generate synthetic expert demonstration datasets using EmbodiChain's built-in environment rollout and dataset manager. You will learn how to configure LeRobot recording in a gym config file (``.json``, ``.yaml``, or ``.yml``), how ``run_env.py`` builds an environment from configuration files, and how completed episodes are automatically saved to disk.

Overview
~~~~~~~~

EmbodiChain provides a built-in data generation workflow for imitation-learning and manipulation tasks:

- **Gym Configuration**: Describes the scene, robot, sensors, randomization events, observations, dataset recorder, and rollout settings.
- **Action Configuration**: Describes the task-specific expert action graph for tasks that use the action bank.
- **Environment Rollout**: Builds the environment directly from configuration files and executes offline generation.
- **Expert Policy**: Each task provides ``create_demo_action_list()`` or another scripted policy entry to generate expert actions.
- **Dataset Manager**: Records observation-action pairs during ``env.step()``.
- **LeRobotRecorder**: Converts completed episodes into LeRobot-compatible datasets, with optional video export.

What This Tutorial Records
--------------------------

This page documents the full path from task configuration to saved dataset:

1. Prepare a task gym config (e.g. ``gym_config.json`` or ``gym_config.yaml``).
2. Prepare an action config if the task uses the action bank (same supported extensions).
3. Launch the environment rollout with ``run-env``.
4. Let the dataset manager automatically save completed episodes.

Example Task
------------

As a concrete example, this tutorial uses a real action-bank task shipped in the repository:

- ``configs/gym/pour_water/gym_config.json`` defines the simulation scene and dataset recording behavior (YAML equivalents such as ``configs/gym/cobotmagic.yaml`` are also supported).
- ``configs/gym/pour_water/action_config.json`` defines the action-bank graph used to solve the task.

The Code
~~~~~~~~

The tutorial corresponds to the ``run_env.py`` script in ``embodichain/lab/scripts``.

.. dropdown:: Code for run_env.py
   :icon: code

   .. literalinclude:: ../../../embodichain/lab/scripts/run_env.py
      :language: python
      :linenos:


The Code Explained
~~~~~~~~~~~~~~~~~~

The rollout script builds the environment from configuration, generates expert trajectories, executes them step by step, and relies on the dataset manager to auto-save valid episodes.

Step 1: Prepare the Task Configuration
--------------------------------------

The first input to the pipeline is the task gym config file. In the example below, the same file contains rollout settings, scene randomization, observations, dataset recording, and robot or sensor definitions.

The rollout settings include the episode count:

.. literalinclude:: ../../../configs/gym/pour_water/gym_config.json
   :language: json
   :lines: 2-4

The dataset-related part looks like this:

.. literalinclude:: ../../../configs/gym/pour_water/gym_config.json
   :language: json
   :lines: 261-281

Important parameters are:

- **max_episodes**: Number of rollout episodes generated by ``run_env.py``.
- **max_episode_steps**: Maximum number of environment steps per episode.
- **dataset.lerobot.params.robot_meta**: Robot metadata such as robot type and control frequency.
- **dataset.lerobot.params.instruction**: Task language instruction stored together with the dataset.
- **dataset.lerobot.params.extra**: Additional metadata such as scene type and task description.
- **dataset.lerobot.params.use_videos**: Whether camera observations should be stored as videos.
- **env.control_parts**: Controlled robot parts in the environment.


In the current implementation, ``LeRobotRecorder`` stores robot state and action features following LeRobot official format: ``observation.state`` for joint positions, ``action`` for applied actions, and ``observation.images.{sensor_name}`` for camera images.

Step 2: Prepare the Action Configuration
----------------------------------------

For tasks that use the action bank, the second input is ``action_config.json``. This file defines the expert action graph consumed by ``create_demo_action_list()``. In the example below, the file is organized around ``scope``, ``node``, ``edge``, and ``sync``.

.. dropdown:: Action bank structure in the example task Pour_Water
   :icon: code

   **Scope Configuration**

   .. literalinclude:: ../../../configs/gym/pour_water/action_config.json
      :language: json
      :lines: 2-57

   **Node Configuration**

   .. literalinclude:: ../../../configs/gym/pour_water/action_config.json
      :language: json
      :lines: 96-177

   **Edge Configuration**

   .. literalinclude:: ../../../configs/gym/pour_water/action_config.json
      :language: json
      :lines: 763-790

   **Synchronization**

   .. literalinclude:: ../../../configs/gym/pour_water/action_config.json
      :language: json
      :lines: 906-932

This structure defines the expert rollout as follows:

- **Scope**: Defines controllable sub-graphs such as ``right_arm``, ``left_arm``, ``right_eef``, and ``left_eef``.
- **Node**: Defines key poses, targets computed from object affordances, and IK-generated joint targets.
- **Edge**: Defines executable transitions between nodes, including duration and execution function.
- **Sync**: Defines execution order rules between independently configured sub-actions.

Note: Action bank is not the only way to generate demonstrations. Depending on the task design, trajectories can also be produced by other scripted generation methods.

Step 3: Launch the Environment Rollout
--------------------------------------

The rollout script parses command-line arguments, loads the gym and action config files, converts them into environment configuration objects, creates the environment instance, and then runs offline rollout for ``max_episodes`` episodes:

.. literalinclude:: ../../../embodichain/lab/scripts/run_env.py
   :language: python
   :start-at: def cli():
   :end-at:     main(args, env, gym_config)

Each rollout internally calls ``create_demo_action_list()``, validates the returned sequence, executes actions with ``env.step(action)``, and discards invalid rollouts by resetting with ``save_data=False``.

The recommended CLI entrypoint is:

.. code-block:: bash

   python -m embodichain run-env \
       --gym_config configs/gym/pour_water/gym_config.json \
       --action_config configs/gym/pour_water/action_config.json \
       --headless

For interactive inspection, you can use preview mode: replace ``--headless`` with ``--preview``.
When ``--preview`` is enabled, the script opens the environment in an interactive debugging mode. This mode is for inspection and does not save datasets.


Useful CLI arguments:

- **--gym_config**: Path to the task config file (``.json``, ``.yaml``, or ``.yml``).
- **--action_config**: Path to the action-bank config file (``.json``, ``.yaml``, or ``.yml``).
- **--num_envs**: Number of environments to run in parallel.
- **--device**: Simulation device, such as ``cpu`` or ``cuda``.
- **--headless**: Run without GUI for faster generation.
- **--enable_rt**: Enable ray tracing for higher-quality visual observations.
- **--preview**: Launch the environment in interactive preview mode.
- **--filter_dataset_saving**: Disable dataset saving for debugging.

For the complete CLI argument list, see :doc:`CLI Reference </guides/cli>`.

Outputs
~~~~~~~

After successful execution, completed episodes are saved under the configured dataset root. A LeRobot dataset typically contains:

If no explicit save path is provided and ``EMBODICHAIN_DATASET_ROOT`` is not set, ``LeRobotRecorder`` uses ``~/.cache/embodichain_datasets`` as the default dataset root.

- **data/**: Recorded action and state data.
- **videos/**: Camera observations saved as videos when ``use_videos=True``.
- **meta/**: Dataset metadata such as task information and robot description.

Dataset folders are automatically numbered, which makes it easy to run repeated generations without overwriting previous results.

In a practical workflow, the output of this stage is the synthesized dataset itself. Later training scripts typically consume these saved LeRobot episodes instead of regenerating trajectories each time.

Best Practices
~~~~~~~~~~~~~~

- **Keep the config pair together**: Version gym and action configs together for action-bank tasks (either JSON or YAML).
- **Use valid scripted policies**: Make sure ``create_demo_action_list()`` returns executable trajectories for the current scene.
- **Use ``--headless`` for throughput**: Disable the GUI when generating large datasets.
- **Use ``--preview`` and ``--filter_dataset_saving`` for debugging**: Inspect task logic without writing datasets.
- **Discard invalid rollouts**: Keep the default validation logic so failed trajectories are not saved.
