
关于
Claude Code 会话的正式评估框架,实现评估驱动开发(EDD)原则。
name: eval-harness description: 用于 Claude Code 会话的正式评估框架,实现评估驱动开发 (EDD) 原则 origin: ECC tools: Read, Write, Edit, Bash, Grep, Glob
评估框架技能
用于 Claude Code 会话的正式评估框架,实现评估驱动开发 (EDD) 原则。
激活时机
- 为 AI 辅助工作流设置评估驱动开发 (EDD)
- 定义 Claude Code 任务完成的通过/失败标准
- 使用 pass@k 指标衡量代理可靠性
- 为提示或代理变更创建回归测试套件
- 跨模型版本基准测试代理性能
理念
评估驱动开发将评估视为"AI 开发的单元测试":
- 在实现之前定义预期行为
- 在开发过程中持续运行评估
- 跟踪每次变更的回归
- 使用 pass@k 指标衡量可靠性
评估类型
能力评估
测试 Claude 是否能做到之前做不到的事情:
[CAPABILITY EVAL: feature-name]
Task: Description of what Claude should accomplish
Success Criteria:
- [ ] Criterion 1
- [ ] Criterion 2
- [ ] Criterion 3
Expected Output: Description of expected result
回归评估
确保变更不会破坏现有功能:
[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
- existing-test-1: PASS/FAIL
- existing-test-2: PASS/FAIL
- existing-test-3: PASS/FAIL
Result: X/Y passed (previously Y/Y)
评分器类型
1. 基于代码的评分器
使用代码进行确定性检查:
# Check if file contains expected pattern
grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"
# Check if tests pass
npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
# Check if build succeeds
npm run build && echo "PASS" || echo "FAIL"
2. 基于模型的评分器
使用 Claude 评估开放式输出:
[MODEL GRADER PROMPT]
Evaluate the following code change:
1. Does it solve the stated problem?
2. Is it well-structured?
3. Are edge cases handled?
4. Is error handling appropriate?
Score: 1-5 (1=poor, 5=excellent)
Reasoning: [explanation]
3. 人工评分器
标记需要人工审查:
[HUMAN REVIEW REQUIRED]
Change: Description of what changed
Reason: Why human review is needed
Risk Level: LOW/MEDIUM/HIGH
指标
pass@k
"k 次尝试中至少一次成功"
- pass@1:首次尝试成功率
- pass@3:3 次尝试内成功
- 典型目标:pass@3 > 90%
pass^k
"k 次试验全部成功"
- 更高的可靠性标准
- pass^3:连续 3 次成功
- 用于关键路径
评估工作流
1. 定义(编码前)
## EVAL DEFINITION: feature-xyz
### Capability Evals
1. Can create new user account
2. Can validate email format
3. Can hash password securely
### Regression Evals
1. Existing login still works
2. Session management unchanged
3. Logout flow intact
### Success Metrics
- pass@3 > 90% for capability evals
- pass^3 = 100% for regression evals
2. 实现
编写代码以通过定义的评估。
3. 评估
# Run capability evals
[Run each capability eval, record PASS/FAIL]
# Run regression evals
npm test -- --testPathPattern="existing"
# Generate report
4. 报告
EVAL REPORT: feature-xyz
========================
Capability Evals:
create-user: PASS (pass@1)
validate-email: PASS (pass@2)
hash-password: PASS (pass@1)
Overall: 3/3 passed
Regression Evals:
login-flow: PASS
session-mgmt: PASS
logout-flow: PASS
Overall: 3/3 passed
Metrics:
pass@1: 67% (2/3)
pass@3: 100% (3/3)
Status: READY FOR REVIEW
集成模式
实现前
/eval define feature-name
在 .claude/evals/feature-name.md 创建评估定义文件
实现中
/eval check feature-name
运行当前评估并报告状态
实现后
/eval report feature-name
生成完整评估报告
评估存储
在项目中存储评估:
.claude/
evals/
feature-xyz.md # Eval definition
feature-xyz.log # Eval run history
baseline.json # Regression baselines
最佳实践
- 编码前定义评估 - 强制清晰思考成功标准
- 频繁运行评估 - 尽早发现回归
- 跟踪 pass@k 随时间变化 - 监控可靠性趋势
- 尽可能使用代码评分器 - 确定性 > 概率性
- 安全相关需人工审查 - 永远不要完全自动化安全检查
- 保持评估快速 - 慢评估不会被运行
- 评估与代码一起版本化 - 评估是一等公民
示例:添加认证
## EVAL: add-authentication
### Phase 1: Define (10 min)
Capability Evals:
- [ ] User can register with email/password
- [ ] User can login with valid credentials
- [ ] Invalid credentials are rejected
- [ ] Session token is generated on login
- [ ] Protected routes require authentication
Regression Evals:
- [ ] Public routes still accessible
- [ ] Existing API endpoints unchanged
- [ ] Database migrations are reversible
兼容工具
Claude CodeCursor
标签
AI与机器学习
