Terminus-2
Harbor's high-performance reference agent implementation
Overview
Terminus-2 is Harbor's reference agent implementation, designed as a research-preview agent for evaluating language models' capabilities in terminal environments. It operates entirely autonomously within sandboxed environments and serves as a high-performance neutral test bed for understanding language model agent capabilities.
Key Features
Mono-tool Design
Terminus-2 uses a unique single-tool approach - an interactive tmux session - allowing it to:
- Send keystrokes and navigate environments flexibly
- Scroll through output and use arrow keys to navigate menus
- Launch additional shells within the environment
- Interact with any terminal-based application naturally
This design philosophy enables the agent to work with virtually any command-line interface without requiring specialized tools for each interaction pattern.
Independent Execution
The agent's logic runs in a separate Python process from the Docker container, enabling:
- Remote connection to arbitrary computer environments
- Dockerized execution environments for safety and isolation
- Flexible deployment across different infrastructure setups
- Clean separation between agent logic and task environment
Autonomy-First Approach
Terminus-2 is designed to operate without human intervention:
- Will never ask for user input during task execution
- Independently attempts to complete tasks end-to-end
- Currently recommended only for sandboxed environments due to full autonomy
- Makes decisions and recovers from errors without guidance
Using Terminus-2 with Harbor
Basic Usage
Run Terminus-2 on a task using the --agent terminus-2 flag:
harbor run \
--agent terminus-2 \
--model openai/gpt-5 \
--path examples/tasks/ \
--task-name hello-worldConfiguration Options
Terminus-2 supports various configuration options through the agent config:
from harbor.models.trial.config import AgentConfig
from harbor.models.agent_name import AgentName
agent_config = AgentConfig(
name=AgentName.TERMINUS_2,
model_name="openai/gpt-5",
kwargs={
# Parser configuration
"parser_name": "json", # "json" or "xml" (default: "json")
# API configuration
"api_base": "https://your-vllm-server.com", # Custom API endpoint
"temperature": 0.7, # Sampling temperature (default: 0.7)
# Episode/turn limits
"max_turns": 100, # Maximum number of episodes (default: 1000000)
# Summarization configuration
"enable_summarize": True, # Enable context summarization (default: True)
"proactive_summarization_threshold": 8000, # Free tokens threshold for summarization (default: 8000)
# RL training configuration (default: False)
# If enabled, token ids and logprobs are collected in result and persisted in trajectories
"collect_rollout_details": False,
# Advanced model configuration
"reasoning_effort": "medium", # "none", "minimal", "low", "medium", "high", or "default" (default: None)
"max_thinking_tokens": 2048, # For Anthropic extended thinking mode (minimum: 1024, default: None)
# Optional: Register custom model info with LiteLLM
# LiteLLM doesn't recognize uncommon models like custom models. For metrics
# tracking and context summarization to work properly, provide model_info following
# https://docs.litellm.ai/docs/completion/token_usage#9-register_model
"model_info": {
"max_input_tokens": 128000,
"max_output_tokens": 4096,
"input_cost_per_token": 0.000003,
"output_cost_per_token": 0.000015,
},
# Session tracking (included in the LLM request body unless LLM provider doesn't support)
"session_id": "custom-session-id", # Custom session ID (default: auto-generated UUID)
}
)Conversation History Management
Terminus-2 implements intelligent conversation history management to handle long-running tasks efficiently while staying within context window limits.
Standard Summarization Process
Both proactive and passive summarization use a 3-step subagent process to generate high-quality summaries:
┌─────────────────────────────────────────────────────────────────┐
│ Standard Summarization Flow │
└─────────────────────────────────────────────────────────────────┘
Previous History
│
▼
┌─────────────────────┐
│ 1. Summary Subagent │
│ Input: Previous │
│ Output: Summary │
└─────────────────────┘
│
▼
┌─────────────────────┐
│ 2. Question Subagent│
│ Input: Summary │
│ Output: Questions │
└─────────────────────┘
│
▼
┌─────────────────────┐
│ 3. Answer Subagent │
│ Input: Previous + │
│ Summary + Qs │
│ Output: Answers │
└─────────────────────┘
│
▼
┌─────────────────────┐
│ Main Agent │
│ Context: │
│ • System prompt │
│ • Task │
│ • Summary │
│ • Questions │
│ • Answers │
└─────────────────────┘Step 1 - Summary Subagent: Receives the full previous conversation history and generates an initial summary.
Step 2 - Question Subagent: Receives only the summary (not the full history) and generates clarifying questions about any missing critical information.
Step 3 - Answer Subagent: Receives the previous history, summary, and questions, then answers the questions to fill in the gaps.
The main agent then continues with a compressed context containing: system prompt, task description, summary, questions, and answers.
Proactive Summarization
When free tokens (max input tokens - current context length) drop below the proactive_summarization_threshold (default: 8000), Terminus-2:
- Pauses execution
- Runs the standard 3-step summarization process on the conversation history
- Replaces the middle portion of the conversation history with the summary + Q&A
- Keeps the system prompt and task description intact
- Resumes execution with the compressed history
The threshold can be configured via proactive_summarization_threshold in agent config.
Passive Summarization
When a ContextLengthExceededError occurs, Terminus-2 uses a 3-way fallback strategy to recover and continue execution:
┌─────────────────────────────────────────────────────────────────┐
│ Passive Summarization Fallback Flow │
└─────────────────────────────────────────────────────────────────┘
ContextLengthExceededError
│
▼
┌──────────────────────────────┐
│ 1. Unwind to Free Tokens │
│ Remove recent messages │
│ from end until enough │
│ space (keeps first msg) │
└──────────────────────────────┘
│
▼
┌──────────────────────────────┐
│ 2. Standard Summarization │
│ (3-step subagent process) │
└──────────────────────────────┘
│
┌────────┴────────┐
│ │
Success Failure
│ │
│ ▼
│ ┌──────────────────────────┐
│ │ 3. Fallback Summary │
│ │ Only: System prompt + │
│ │ Task + Current state │
│ └──────────────────────────┘
│ │
│ ┌────────┴────────┐
│ │ │
│ Success Failure
│ │ │
│ │ ▼
│ │ ┌──────────────────────┐
│ │ │ 4. Ultimate Fallback │
│ │ │ System prompt + │
│ │ │ Task + State only │
│ │ │ (Continue without │
│ │ │ summarization) │
│ │ └──────────────────────┘
│ │ │
└────────┴─────────────────┘
│
▼
Continue execution with
compressed/recovered contextStep 1 - Unwind: Remove recent messages from the end of the conversation (in pairs of user + assistant) until there are enough free tokens for summarization. Always keeps at least the first message.
Step 2 - Standard Summarization: Run the 3-step subagent process. If successful, replace the unwound messages with the summary + Q&A and continue execution.
Step 3 - Fallback: If standard summarization fails, attempt a simpler summary using only system prompt, task description, and current state. If successful, continue with this compressed context.
Step 4 - Ultimate Fallback: If fallback also fails, continue execution with only system prompt, task description, and current state (no summary).
This recovery mechanism allows Terminus-2 to continue executing even when context limits are exceeded. Enable with enable_summarize=True in agent config.
Reinforcement Learning Support
Terminus-2 is designed with RL training in mind and collects detailed rollout information for use in RL pipelines.
Rollout Details Collection
During execution, Terminus-2 can collect and export:
Token Information
- Prompt Token IDs: List of token ID sequences, one per turn. Each sequence contains the full prompt including chat history.
- Completion Token IDs: List of token ID sequences, one per turn. Each sequence contains the response tokens for that turn.
- Logprobs: List of log probability sequences corresponding to each completion.
These are stored as a list of RolloutDetail objects in the agent result metadata:
# First RolloutDetail contains main agent conversation
rollout_detail = trial_result.agent_result.metadata["rollout_details"][0]
# Access turn-by-turn data
prompt_token_ids = rollout_detail["prompt_token_ids"] # List[List[int]]
completion_token_ids = rollout_detail["completion_token_ids"] # List[List[int]]
logprobs = rollout_detail["logprobs"] # List[List[float]]Rewards
Terminus-2 integrates with Harbor's verifier system to collect rewards:
# Access rewards from trial results
reward = trial_result.verifier_result.rewards.get("reward", 0)Trajectory Format
Terminus-2 automatically generates trajectories in the Agent Trajectory Interchange Format (ATIF), Harbor's standardized trajectory format. This enables:
- SFT dataset generation: Convert successful trajectories to supervised fine-tuning data
- RL training: Use complete action sequences and rewards for policy optimization
- Debugging: Inspect detailed step-by-step execution logs
- Visualization: Replay agent actions in Harbor's trajectory viewer
See the Agent Trajectory Format documentation for details on the ATIF specification.
Related Documentation
- Agents Overview - General agent integration guide
- Agent Trajectory Format - ATIF specification and usage
- RL Training - Using Terminus-2 for reinforcement learning
- SFT Datasets - Generating supervised fine-tuning data