Adapters
How to create a new adapter for a new benchmark using Harbor.
Harbor supports running various benchmarks and datasets via a simple, unified interface. SWE-Bench, LiveCodeBench, and more benchmarks are integrated into Harbor, and our team is actively working to adapt additional benchmarks to the framework. To add a new benchmark or dataset, you need to create an adapter that translates the original benchmark's tasks into the Harbor format.
We welcome the open source community to contribute adapters for new benchmarks and datasets. If you have a benchmark or a dataset of tasks that you want to adapt (e.g., using Harbor's evaluation harness), please follow the steps below to develop your adapter and get it merged.
We'd love to help you adapt your benchmark!
If you are thinking about adapting your benchmark or contributing one from our Adapter List, please join our Discord and reach out to Lin Shi from the #adapters-announcements channel.
Terminal-Bench only: Translating Terminal-Bench Adapters to Harbor
See this section to learn about the requirements and differences between Terminal-Bench and Harbor.
Overview
Adapting a benchmark to Harbor is a straightforward process designed to ensure consistency and quality. This guide will walk you through everything you need to know. However, since each benchmark is unique, the exact process and special requirements may vary slightly depending on the benchmark. Please contact our team to understand the specific requirements and considerations for your benchmark. We will support API costs for running parity experiments :-)
Here's a quick look at the typical steps:
- Fork and Prepare the Original Benchmark: First, you'll get familiar with the original benchmark, fork its repository, and run experiments to establish baseline performance metrics.
- Create the Harbor Adapter: Next, you'll write Python adapter code that translates the original benchmark's tasks into the Harbor format. This is the core of the process and involves several substeps:
- Fork the Harbor repository: Get your own copy of the codebase and work on a new branch.
- Develop the adapter code: Convert tasks into the Harbor format.
- Running Harbor Harness: Learn the different ways to run Harbor harness on your adapter and verify oracle solutions.
- Run Parity Experiments: Verify your adapter's performance against the original benchmark baseline results.
- Record Parity Results: Formally document the performance comparison.
- Register the Dataset: Add your new tasks to the official dataset repository and registry.
- Document and Submit: Finally, you'll document your adapter's usage and parity results in a
README.mdand submit your work through a pull request.
We'll break down each step in detail below. Let's get started!
The Adapter Development Workflow
Creating a high-quality adapter involves several key steps. Following this workflow ensures that the adapted benchmark is a faithful and reliable implementation of the original.
1. Fork and Prepare the Original Benchmark
Before writing any adapter code, it's crucial to deeply understand the original benchmark. Then you will need to fork the original benchmark's repository and write scripts to allow Harbor-comparable experiments. Specifically, you will need to:
-
Fork the original benchmark's repository: Create a new branch for your adaptation work (e.g.,
harbor-adapter). -
Implement agents and models: The ultimate goal for this step is to enable fair comparisons between the original benchmark's harness and Harbor. Therefore, you may either:
- Implement CLI agents and models that are supported by Harbor (Codex, Claude-Code, Gemini-CLI, etc.) in the original benchmark if none exist, or
- Skip this step and instead implement a custom agent in Harbor, following the agent tutorial.
There are no restrictions on how many or which agents you choose. However, the comparison results should be sufficient to demonstrate that the Harbor adapter faithfully reproduces the original benchmark's harness.
-
Establish a baseline: Run the original benchmark with the agents and models.
-
Write scripts and documentation: Write scripts to allow Harbor-comparable experiments, and document the process in a
README.mdfile.
Keep a link to the forked repo, and you are good to go!
2. Create the Harbor Adapter
With a solid baseline, you can now create the adapter itself within the harbor repository. Here are the main steps:
2.0 Read the README template
The Harbor adapter README template serves as the template for the final README file that you will create for your submitted adapter. However, it is more than just a template: it includes essential instructions to help you understand the requirements that will facilitate the development and review processes. Reading it will give you a sense of what to provide and will guide your code, experiments, and documentation.
2.1 Fork the Harbor repository
Fork the Harbor repository and create a new branch for your adapter (e.g., {adapter-name}-adapter).
git clone https://github.com/{your-github-username}/harbor.git
cd harbor
git checkout -b {your-adapter-name}-adapter2.2 Develop the adapter code
Develop the adapter under adapters/{adapter-name}. You may refer to the existing adapters in the adapters/ directory and follow the patterns. The adapter's primary job is to parse the original benchmark's data and generate task directories in the standard Harbor format. Here is an example architecture of the task directory:
Here is an example task directory. Your code should prepare task directories locally following a similar format.
2.3 Requirements and Tips for the Adapter Code
Your adapter code is used to generate task directories. A typical directory structure for your adapter code is as follows:
More details (expand to view):
Important: You need to ensure that the oracle solution passes with a 100% reward (i.e., all tests pass when running the solution). After preparing task directories locally, you may simply run
uv run harbor jobs start -p datasets/<your-adapter-name>from the harbor repo root to run oracle tests and/or debug.
2.4 Running Harbor Harness
There are several ways to run Harbor harness on your adapter:
Option 1: Using individual trials (for testing single tasks)
# Run oracle agent on a single task
uv run harbor trials start -p datasets/<your-adapter-name>/<task-id>
# Run with specific agent and model
uv run harbor trials start -p datasets/<your-adapter-name>/<task-id> -a <agent-name> -m <model-name>Option 2: Using jobs with local dataset path
# Run on entire local dataset
uv run harbor jobs start -p datasets/<your-adapter-name> -a <agent-name> -m <model-name>Option 3: Using jobs with configuration file. Refer to harbor/examples/configs for configuration examples. It's highly recommended to write a reference config file for your adapter to ensure reproducibility.
# Create a job config YAML (see harbor/examples/configs/ for examples)
uv run harbor jobs start -c adapters/<your-adapter-name>/<config>.yaml -a <agent-name> -m <model-name>Option 4: Using local registry after your dataset registry PR gets merged. This step is required to check the correctness of (1) your registered dataset and (2) your updated registry.json in the Harbor repository. If this run successfully passes all oracle tests, then after your adapter PR gets merged, people can directly use -d without --registry-path to run evaluation (the same way as Option 5).
# Run from registry
uv run harbor jobs start -d <your-adapter-name> --registry-path registry.json -a <agent-name> -m "<model-name>"Option 5: Using registry dataset (after registration and all PRs merged)
# Run from registry
uv run harbor jobs start -d <your-adapter-name> -a <agent-name> -m "<model-name>"You should include instructions for running in multiple ways in the README.md for your adapter, following the Harbor adapter README template. Note that the order of these options is organized differently in the final adapter README. This is because from the user's perspective, Option 5 is the primary way to run the adapter without needing to prepare task directories; the adapter code and other running methods are mainly used for development and reproduction.
2.5 Run Parity Experiments
Run Oracle First!
Before running any parity experiments with real agents, always verify your oracle solutions pass first. This helps catch task configuration issues early.
Please reach out before you run parity!
Please contact Lin Shi through Discord to get API keys and confirm parity experiment plans before running them. We have benchmark-specific recommendations for agents, models, and the number of trials needed to verify equivalence.
You need to run parity experiments to verify the adapter. First, run experiments with selected agents and models in Step 1; then use the Harbor harness (see Section 2.4) with the same set of agents and models via the adapter. Run them multiple times on each side to compare average scores and standard deviations. The average scores across multiple trials should be comparable to demonstrate equivalence of adaptation (i.e., running the benchmark with Harbor is equivalent to running it with the original harness).
2.6 Record Parity Results
To formally store and track the performance parity between the original benchmark and your adapter, create a parity_experiment.json file in your adapter's directory. A typical file would look like this:
[
{
"adapter_name": <adapter-name>,
"agent": <agent-name>@<agent-version>,
"model": <model-name-with-detailed-version>,
"date": <date>,
"notes": <notes>, // e.g., "n tasks; averaged over 5 trials"
"forked_repo": <forked-repo-link>, // For reproducing the parity experiments on the original benchmark side; usually this is a fork of the original benchmark repo
"adapter_pr": <adapter-pr-link>, // You can add this link after the adapter PR is created
"dataset_pr": <dataset-pr-link>, // Registry PR link
"metrics": [
{
"benchmark_name": <original-benchmark-name>,
"metric": <metric>,
"original": <mean +/- std_error>, // Scores obtained from the original benchmark
"harbor": <mean +/- std_error>, // Scores obtained from Harbor adapter
}, // Multiple metrics are supported and should follow a similar format
]
},
...
]You should also include the parity experiment results in the README.md of your adapter. For example, you can add the following table:
| Agent | Model | Metric | Original Benchmark Performance | Harbor Adapter Performance |
|-------|-------|--------|--------------------------------|----------------------------|
| claude-code | claude-4.1-opus | Metric | Score ± Std | Score ± Std |
| codex | gpt-5 | Metric | Score ± Std | Score ± Std |
| ... | ... | ... | ... | ... |Then include the following links:
- The link to the original benchmark's GitHub repository
- The link to the forked repo from Step 1
- The link to the dataset PR from Step 2.7
- [Optional] The link to the adapter PR, which you can add in another commit after the adapter PR is created
2.7 Register the Dataset
Once your adapter correctly generates tasks and you verify the parity experiments, you should add them to the official Harbor datasets repository.
- Fork and clone the dataset repository:
git clone https://github.com/{your-github-username}/harbor-datasets.git - Add your tasks: Place the generated task directories under
datasets/<your-adapter-name>/. For example, if you follow the adapter development instructions above correctly, you should be able to run the following example commands to add your tasks to the dataset repository:cd harbor/adapters/<your-adapter-name> # Specify custom path to the harbor-datasets repo uv run run_adapter.py --output-dir /path/to/harbor-datasets/datasets/<your-adapter-name> - Pull Request: Create a pull request to the
harbor-datasetsrepository. It's recommended to link the original benchmark's GitHub repository in your PR. Request @Slimshilin for review, and he will merge it so you can try--registry-pathfor theharborharness. You may always submit another PR to update the dataset registry.
Then you should navigate to the harbor repository (not the dataset repo!) and add a new entry to the registry.json file in the root. Note: Harbor's registry format uses task-level entries with Git URLs. For example:
[
// existing entries...
{
"name": "<your-adapter-name>",
"version": "1.0",
"description": "A brief description of the adapter. Original benchmark: [URL]. More details at [adapter README URL].",
"tasks": [
{
"name": "<task-id-1>",
"git_url": "https://github.com/laude-institute/harbor-datasets.git",
"git_commit_id": "<commit-hash>",
"path": "datasets/<your-adapter-name>/<task-id-1>"
},
{
"name": "<task-id-2>",
"git_url": "https://github.com/laude-institute/harbor-datasets.git",
"git_commit_id": "<commit-hash>",
"path": "datasets/<your-adapter-name>/<task-id-2>"
}
// ... more tasks
]
}
]For initial development, you can use "git_commit_id": "head" to reference the latest commit, but for production you should pin to a specific commit hash for reproducibility.
Important: After your dataset registry PR is merged, run the following command to verify that your registered dataset and registry.json are correctly configured before submitting the final adapter PR.
# Run from registry
uv run harbor jobs start -d <your-adapter-name> --registry-path registry.json"3. Document and Submit
Follow the Harbor adapter README template again to draft documentation for your adapter.
4. Other Useful Resources
- The Harbor documentation provides detailed information about running tasks and jobs with Harbor.
- The Harbor repository contains example tasks and configurations.
- The agent tutorial provides instructions on how to create and use your customized agent in Harbor.
5. Getting Help
Thank you for your interest in Harbor and building an adapter! If you have any questions, please ask in the #adapters-spam channel in our Discord.
Translating Terminal-Bench Adapters to Harbor
If you have an existing Terminal-Bench adapter and want to convert it to Harbor format, this section outlines the key differences and migration steps. Harbor maintains the same core principles as Terminal-Bench but uses a different file structure and configuration format.
Note that the Harbor adapter should be isolated from the Terminal-Bench repo. You are expected to write adapter code following the same process as for Terminal-Bench instead of applying a direct translation script. Fortunately, with a good Terminal-Bench adapter, it is relatively easy to create a Harbor adapter by handling a slightly different task format.
Key Format Differences
The following table summarizes the main differences between Terminal-Bench and Harbor task formats:
| Aspect | Terminal-Bench | Harbor |
|---|---|---|
| Task Configuration | task.yaml (YAML format) | task.toml (TOML format) |
| Instruction | Embedded in task.yaml as instruction field | Separate instruction.md file |
| Dockerfile Location | Root level: Dockerfile | Subdirectory: environment/Dockerfile |
| Solution Script | Root level: solution.sh | Subdirectory: solution/solve.sh |
| Test Scripts | Root level: run-tests.sh + tests/test_outputs.py | Subdirectory: tests/test.sh |
| Test Verification | Exit code based (pytest) | Reward-based: write to /logs/verifier/reward.txt |
| Docker Compose | docker-compose.yaml in task root | Not typically used per-task |
| Default Output Directory | tasks/<adapter-name> | datasets/<adapter-name> |
| Registry Format | Dataset-level with dataset_path | Task-level with git_url and path per task |
| CLI Commands | tb run --dataset / tb run --dataset-path | harbor jobs start -d / harbor trials start -p |
| Metrics | Resolved rate (binary pass/fail per task) | Rewards that support multiple metrics and float-type values |
IMPORTANT: If the Terminal-Bench adapter used a tweaked metric (e.g., threshold-based scoring, ignoring certain metrics), then you'll need to re-implement the adapter for Harbor to support the original metrics used by the benchmark, as Harbor now supports multiple metrics as rewards.
File Structure Migration
Terminal-Bench structure:
task-id/
├── task.yaml
├── Dockerfile
├── docker-compose.yaml
├── run-tests.sh
├── solution.sh
└── tests/
└── test_outputs.pyHarbor structure:
task-id/
├── task.toml
├── instruction.md
├── environment/
│ └── Dockerfile
├── solution/
│ └── solve.sh
└── tests/
├── test.sh
└── test_*.py (optional)Migration Steps
Step 1: Update Task Configuration Format
Convert task.yaml to task.toml and extract the instruction:
Before (task.yaml):
instruction: |
Your task instruction here...
Multiple lines...
author_email: example@email.com
author_name: Author Name
difficulty: hard
category: programming
tags:
- debugging
- python
parser_name: swebench
max_agent_timeout_sec: 3000.0
max_test_timeout_sec: 3000.0After (task.toml):
version = "1.0"
[metadata]
author_email = "example@email.com"
author_name = "Author Name"
difficulty = "hard"
category = "programming"
tags = ["debugging", "python"]
[agent]
timeout_sec = 3000.0
[verifier]
timeout_sec = 3000.0And create instruction.md:
Your task instruction here...
Multiple lines...Step 2: Reorganize Files into Subdirectories
- Move
Dockerfile→environment/Dockerfile - Move
solution.sh→solution/solve.sh - Move
run-tests.sh→tests/test.sh - Remove
docker-compose.yaml(usually not needed per-task in Harbor)
Step 3: Update Test Scripts for Reward-Based System
Before (run-tests.sh in Terminal-Bench):
#!/bin/bash
# Run tests and create marker file
pytest tests/ > test_results.txt
if [ $? -eq 0 ]; then
echo "PASSED" > /tmp/test_marker.txt
else
echo "FAILED" > /tmp/test_marker.txt
fiAfter (tests/test.sh in Harbor):
#!/bin/bash
# Install dependencies if needed
apt-get update && apt-get install -y python3-pip
pip3 install pytest
# Run tests
pytest /tests/test_*.py
# Write reward based on test results
if [ $? -eq 0 ]; then
echo 1 > /logs/verifier/reward.txt
else
echo 0 > /logs/verifier/reward.txt
fiKey changes:
- Harbor mounts
/logs/verifierfor test outputs - Write numeric reward (can be float type) to
/logs/verifier/reward.txt - Can still use pytest, but final output must be the reward file
Step 4: Update Adapter Code
- Change default output directory from
tasks/todatasets/ - Update template directory to match Harbor structure
- Modify file generation logic to create subdirectories (
environment/,solution/,tests/) - Split instruction extraction into separate
instruction.mdfile - Convert YAML generation to TOML generation
Step 5: Update Registry Format
Terminal-Bench registry.json:
{
"name": "my-adapter",
"version": "head",
"description": "...",
"github_url": "https://github.com/laude-institute/terminal-bench-datasets.git",
"dataset_path": "datasets/my-adapter",
"task_id_subset": null
}Harbor registry.json:
{
"name": "my-adapter",
"version": "1.0",
"description": "...",
"tasks": [
{
"name": "task-1",
"git_url": "https://github.com/laude-institute/harbor-datasets.git",
"git_commit_id": "abc123",
"path": "datasets/my-adapter/task-1"
}
// ... one entry per task
]
}Getting Help
If you have any questions about translating your Terminal-Bench adapter to Harbor, please ask in the #adapters-spam channel in our Discord or reach out to Lin Shi.