EmbodiAgent#
EmbodiAgent is a hierarchical multi-agent system that enables robots to perform complex manipulation tasks through closed-loop planning, code generation, and validation. The system combines vision-language models (VLMs) and large language models (LLMs) to translate high-level goals into executable robot actions.
Quick Start#
Prerequisites#
Ensure you have access to Azure OpenAI or a compatible LLM endpoint.
# Set environment variables
export AZURE_OPENAI_ENDPOINT="https://your-endpoint.openai.azure.com/"
export AZURE_OPENAI_API_KEY="your-api-key"
Using Different LLM/VLM APIs#
The system uses LangChain’s AzureChatOpenAI by default. To use different LLM/VLM providers, you can modify the create_llm function in embodichain/agents/hierarchy/llm.py.
Azure OpenAI#
export AZURE_OPENAI_ENDPOINT="https://your-endpoint.openai.azure.com/"
export AZURE_OPENAI_API_KEY="your-api-key"
export OPENAI_API_VERSION="2024-10-21" # Optional, defaults to "2024-10-21"
OpenAI#
To use OpenAI directly instead of Azure, modify llm.py:
from langchain_openai import ChatOpenAI
def create_llm(*, temperature=0.0, model="gpt-4o"):
return ChatOpenAI(
temperature=temperature,
model=model,
api_key=os.getenv("OPENAI_API_KEY"),
)
Then set:
export OPENAI_API_KEY="your-api-key"
Other Providers#
You can use other LangChain-compatible providers by modifying the create_llm function, for example:
Anthropic Claude:
from langchain_anthropic import ChatAnthropic
def create_llm(*, temperature=0.0, model="claude-3-opus-20240229"):
return ChatAnthropic(
temperature=temperature,
model=model,
anthropic_api_key=os.getenv("ANTHROPIC_API_KEY"),
)
Google Gemini:
from langchain_google_genai import ChatGoogleGenerativeAI
def create_llm(*, temperature=0.0, model="gemini-pro"):
return ChatGoogleGenerativeAI(
temperature=temperature,
model=model,
google_api_key=os.getenv("GOOGLE_API_KEY"),
)
Run the System#
Run the agent system with the following command:
python embodichain/lab/scripts/run_agent.py \
--task_name YourTask \
--gym_config configs/gym/your_task/gym_config.json \
--agent_config configs/gym/agent/your_agent/agent_config.json \
--regenerate False
Parameters:
--task_name: Name identifier for the task--gym_config: Path to the gym environment configuration file--agent_config: Path to the agent configuration file (defines prompts and agent behavior)--regenerate: IfTrue, forces regeneration of plans/code even if cached
System Architecture#
The system operates on a closed-loop control cycle:
Observe: The
TaskAgentperceives the environment via multi-view camera inputs.Plan: It decomposes the goal into natural language steps.
Code: The
CodeAgenttranslates steps into executable Python code using atomic actions.Execute: The code runs in the environment; runtime errors are caught immediately.
Validate: The
ValidationAgentanalyzes the result images, selects the best camera angle, and judges success.Refine: If validation fails, feedback is sent back to the agents to regenerate the plan or code.
Core Components#
TaskAgent#
Located in: embodichain/agents/hierarchy/task_agent.py
Responsible for high-level reasoning. It parses visual observations and outputs a structured plan.
For every step, it generates a specific condition (e.g., “The cup must be held by the gripper”) which is used later by the ValidationAgent.
Prompt Strategies:
one_stage_prompt: Direct VLM-to-Plan generation.two_stage_prompt: Separates visual analysis from planning logic.
CodeAgent#
Located in: embodichain/agents/hierarchy/code_agent.py
Translates natural language plans into executable Python code using atomic actions from the action bank.
Generates Python code that follows strict coding guidelines (no loops, only provided APIs)
Executes code in a sandboxed environment with immediate error detection
Uses Abstract Syntax Tree (AST) parsing to ensure code safety and correctness
Supports few-shot learning through code examples in the configuration
ValidationAgent#
Located in: embodichain/agents/hierarchy/validation_agent.py
Closes the loop by verifying if the robot actually achieved what it planned.
Uses a specialized LLM call (
select_best_view_dir) to analyze images from all cameras and pick the single best angle that proves the action’s outcome, ignoring irrelevant views.If an error occurs (runtime or logic), it generates a detailed explanation which is fed back to the
TaskAgentorCodeAgentfor the next attempt.
Configuration Guide#
The Agent configuration block controls the context provided to the LLMs. Prompt files are resolved in the following order:
Config directory: Task-specific prompt files in the same directory as the agent configuration file (e.g.,
configs/gym/agent/pour_water_agent/)Default prompts directory: Reusable prompt templates in
embodichain/agents/prompts/
Parameter |
Description |
Typical Use |
|---|---|---|
|
Task-specific goal description |
“Pour water from the red cup to the blue cup.” |
|
Physical rules & constraints |
World coordinate system definitions, safety rules. |
|
API Documentation |
List of available functions (e.g., |
|
Coding guidelines |
“Use provided APIs only. Do not use loops.” |
|
Few-shot examples |
Previous successful code snippets to guide style. |
File Structure#
embodichain/agents/
├── hierarchy/
│ ├── agent_base.py # Abstract base handling prompts & images
│ ├── task_agent.py # Plan generation logic
│ ├── code_agent.py # Code generation & AST execution engine
│ ├── validation_agent.py # Visual analysis & view selection
│ └── llm.py # LLM configuration and instances
├── mllm/
│ └── prompt/ # Prompt templates (LangChain)
└── prompts/ # Agent prompt templates