Harbor

Harbor is a framework from the creators of Terminal-Bench for evaluating and optimizing agents and language models. It uses LiteLLM to call 100+ LLM providers.

# Install
pip install harbor

# Run a benchmark with any LiteLLM-supported model
harbor run --dataset terminal-bench@2.0 \
   --agent claude-code \
   --model anthropic/claude-opus-4-1 \
   --n-concurrent 4

Key features:

Evaluate agents like Claude Code, OpenHands, Codex CLI
Build and share benchmarks and environments
Run experiments in parallel across cloud providers (Daytona, Modal)
Generate rollouts for RL optimization
GitHub
Documentation