Rollout Buffer#
This module implements the data buffer for RL training, responsible for storing trajectory data from agent-environment interactions.
Main Classes and Structure#
RolloutBuffer#
Used for on-policy algorithms (such as PPO), efficiently stores observations, actions, rewards, dones, values, and logprobs for each step.
Supports multi-environment parallelism (shape: [T, N, …]), all data allocated on GPU.
Structure fields:
obs: Observation tensor, float32, shape [T, N, obs_dim]actions: Action tensor, float32, shape [T, N, action_dim]rewards: Reward tensor, float32, shape [T, N]dones: Done flags, bool, shape [T, N]values: Value estimates, float32, shape [T, N]logprobs: Action log probabilities, float32, shape [T, N]_extras: Algorithm-specific fields (e.g., advantages, returns), dict[str, Tensor]
Main Methods#
add(obs, action, reward, done, value, logprob): Add one step of data.set_extras(extras): Attach algorithm-related tensors (e.g., advantages, returns).iterate_minibatches(batch_size): Randomly sample minibatches, returns dict (including all fields and extras).Supports efficient GPU shuffle and indexing for large-scale training.
Usage Example#
buffer = RolloutBuffer(num_steps, num_envs, obs_dim, action_dim, device)
for t in range(num_steps):
buffer.add(obs, action, reward, done, value, logprob)
buffer.set_extras({"advantages": adv, "returns": ret})
for batch in buffer.iterate_minibatches(batch_size):
# batch["obs"], batch["actions"], batch["advantages"] ...
pass
Design and Extension#
Supports multi-environment parallel collection, compatible with Gymnasium/IsaacGym environments.
All data is allocated on GPU to avoid frequent CPU-GPU copying.
The extras field can be flexibly extended to meet different algorithm needs (e.g., GAE, TD-lambda, distributional advantages).
The iterator automatically shuffles to improve training stability.
Compatible with various RL algorithms (PPO, A2C, SAC, etc.), custom fields and sampling logic supported.
Code Example#
class RolloutBuffer:
def __init__(self, num_steps, num_envs, obs_dim, action_dim, device):
# Initialize tensors
...
def add(self, obs, action, reward, done, value, logprob):
# Add data
...
def set_extras(self, extras):
# Attach algorithm-related tensors
...
def iterate_minibatches(self, batch_size):
# Random minibatch sampling
...
Practical Tips#
It is recommended to call set_extras after each rollout to ensure advantage/return tensors align with main data.
When using iterate_minibatches, set batch_size appropriately for training stability.
Extend the extras field as needed for custom sampling and statistics.