
About
GAN-inspired Generator-Evaluator agent harness for building high-quality applications autonomously. Based on Anthropic's March 2026 harness design paper.
name: gan-style-harness description: "GAN-inspired Generator-Evaluator agent harness for building high-quality applications autonomously. Based on Anthropic's March 2026 harness design paper." origin: ECC-community tools: Read, Write, Edit, Bash, Grep, Glob, Task
GAN-Style Harness Skill
Inspired by Anthropic's Harness Design for Long-Running Application Development (March 24, 2026)
A multi-agent harness that separates generation from evaluation, creating an adversarial feedback loop that drives quality far beyond what a single agent can achieve.
Core Insight
When asked to evaluate their own work, agents are pathological optimists — they praise mediocre output and talk themselves out of legitimate issues. But engineering a separate evaluator to be ruthlessly strict is far more tractable than teaching a generator to self-critique.
This is the same dynamic as GANs (Generative Adversarial Networks): the Generator produces, the Evaluator critiques, and that feedback drives the next iteration.
When to Use
- Building complete applications from a one-line prompt
- Frontend design tasks requiring high visual quality
- Full-stack projects that need working features, not just code
- Any task where "AI slop" aesthetics are unacceptable
- Projects where you want to invest $50-200 for production-quality output
When NOT to Use
- Quick single-file fixes (use standard
claude -p) - Tasks with tight budget constraints (<$10)
- Simple refactoring (use de-sloppify pattern instead)
- Tasks that are already well-specified with tests (use TDD workflow)
Architecture
┌─────────────┐
│ PLANNER │
│ (Opus 4.6) │
└──────┬──────┘
│ Product Spec
│ (features, sprints, design direction)
▼
┌────────────────────────┐
│ │
│ GENERATOR-EVALUATOR │
│ FEEDBACK LOOP │
│ │
│ ┌──────────┐ │
│ │GENERATOR │--build-->│──┐
│ │(Opus 4.6)│ │ │
│ └────▲─────┘ │ │
│ │ │ │ live app
│ feedback │ │
│ │ │ │
│ ┌────┴─────┐ │ │
│ │EVALUATOR │<-test----│──┘
│ │(Opus 4.6)│ │
│ │+Playwright│ │
│ └──────────┘ │
│ │
│ 5-15 iterations │
└────────────────────────┘
The Three Agents
1. Planner Agent
Role: Product manager — expands a brief prompt into a full product specification.
Key behaviors:
- Takes a one-line prompt and produces a 16-feature, multi-sprint specification
- Defines user stories, technical requirements, and visual design direction
- Is deliberately ambitious — conservative planning leads to underwhelming results
- Produces evaluation criteria that the Evaluator will use later
Model: Opus 4.6 (needs deep reasoning for spec expansion)
2. Generator Agent
Role: Developer — implements features according to the spec.
Key behaviors:
- Works in structured sprints (or continuous mode with newer models)
- Negotiates a "sprint contract" with the Evaluator before writing code
- Uses full-stack tooling: React, FastAPI/Express, databases, CSS
- Manages git for version control between iterations
- Reads Evaluator feedback and incorporates it in next iteration
Model: Opus 4.6 (needs strong coding capability)
3. Evaluator Agent
Role: QA engineer — tests the live running application, not just code.
Key behaviors:
- Uses Playwright MCP to interact with the live application
- Clicks through features, fills forms, tests API endpoints
- Scores against four criteria (configurable):
- Design Quality — Does it feel like a coherent whole?
- Originality — Custom decisions vs. template/AI patterns?
- Craft — Typography, spacing, animations, micro-interactions?
- Functionality — Do all features actually work?
- Returns structured feedback with scores and specific issues
- Is engineered to be ruthlessly strict — never praises mediocre work
Model: Opus 4.6 (needs strong judgment + tool use)
Evaluation Criteria
The default four criteria, each scored 1-10:
## Evaluation Rubric
### Design Quality (weight: 0.3)
- 1-3: Generic, template-like, "AI slop" aesthetics
- 4-6: Competent but unremarkable, follows conventions
- 7-8: Distinctive, cohesive visual identity
- 9-10: Could pass for a professional designer's work
### Originality (weight: 0.2)
- 1-3: Default colors, stock layouts, no
