智能体评估

低风险

作者 @affaan-m已验证来源

4.6264 次安装v1.0.0更新于 2026年5月25日

使用方式

在 Claude Code 中运行以下命令

第一步：添加 Marketplace

/plugin marketplace add affaan-m/ECC

第二步：安装插件

/plugin install agent-eval@ecc

关于

编码代理（Claude Code、Aider、Codex 等）在自定义任务上的对比评测，包含通过率、成本、时间和一致性指标。

name: agent-eval description: "对编码 Agent（Claude Code、Aider、Codex 等）进行自定义任务的正面对比，包含通过率、成本、时间和一致性指标" origin: ECC tools: Read, Write, Edit, Bash, Grep, Glob

Agent 评估技能

一个轻量级 CLI 工具，用于在可复现任务上对编码 Agent 进行正面对比。每个"哪个编码 Agent 最好？"的比较都基于感觉 —— 这个工具将其系统化。

何时激活

在你自己的代码库上比较编码 Agent（Claude Code、Aider、Codex 等）
在采用新工具或模型之前衡量 Agent 性能
当 Agent 更新其模型或工具时运行回归检查
为团队产出数据驱动的 Agent 选择决策

安装

注意： 在审查源代码后从其仓库安装 agent-eval。

核心概念

YAML 任务定义

声明式定义任务。每个任务指定做什么、涉及哪些文件以及如何判断成功：

name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
  - src/http_client.py
prompt: |
  Add retry logic with exponential backoff to all HTTP requests.
  Max 3 retries. Initial delay 1s, max delay 30s.
judge:
  - type: pytest
    command: pytest tests/test_http_client.py -v
  - type: grep
    pattern: "exponential_backoff|retry"
    files: src/http_client.py
commit: "abc1234"  # pin to specific commit for reproducibility

Git Worktree 隔离

每次 Agent 运行都获得自己的 git worktree —— 无需 Docker。这提供了可复现性隔离，Agent 之间不会互相干扰或损坏基础仓库。

收集的指标

| 指标 | 衡量内容 | |--------|-----------------| | 通过率 | Agent 产出的代码是否通过了评判？ | | 成本 | 每个任务的 API 花费（如可用） | | 时间 | 完成的实际时间（秒） | | 一致性 | 多次运行的通过率（例如 3/3 = 100%） |

工作流

1. 定义任务

创建 tasks/ 目录，每个任务一个 YAML 文件：

mkdir tasks
# Write task definitions (see template above)

2. 运行 Agent

对你的任务执行 Agent：

agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3

每次运行：

从指定 commit 创建新的 git worktree
将 prompt 交给 Agent
运行评判标准
记录通过/失败、成本和时间

3. 比较结果

生成比较报告：

agent-eval report --format table

Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │
│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │
└──────────────┴───────────┴────────┴────────┴─────────────┘

评判类型

基于代码（确定性）

judge:
  - type: pytest
    command: pytest tests/ -v
  - type: command
    command: npm run build

基于模式

judge:
  - type: grep
    pattern: "class.*Retry"
    files: src/**/*.py

基于模型（LLM 作为评判）

judge:
  - type: llm
    prompt: |
      Does this implementation correctly handle exponential backoff?
      Check for: max retries, increasing delays, jitter.

最佳实践

从 3-5 个任务开始，代表你的真实工作负载，而非玩具示例
每个 Agent 至少运行 3 次以捕获方差 —— Agent 是非确定性的
在任务 YAML 中固定 commit，使结果在数天/数周内可复现
每个任务至少包含一个确定性评判（测试、构建）—— LLM 评判会增加噪音
同时跟踪成本和通过率 —— 10 倍成本的 95% Agent 可能不是正确选择
版本化你的任务定义 —— 它们是测试 fixture，像代码一样对待

链接

仓库：github.com/joaquinhuigomez/agent-eval

兼容工具

Claude CodeCursor