Quality Baseline Testing

The model isn't why your agent fails

70% of your AI agent's performance lives outside the model. HarnessKit tests the other 70%: the scaffold, the tools, the prompts, the glue. Baseline it. Benchmark it. Catch regressions before production.

Harness as code. Tested like code.

Define your agent's scaffold configuration as a versioned, testable artifact. System prompts, tool configs, routing logic, guardrails. All of it.

Run the same task suite across different harness versions. Compare quality scores. See exactly what changed and what broke.

// Define a quality baseline
const baseline = defineHarness({
  name: "coding-agent-v3",
  model: "claude-opus-4.6",
  scaffold: {
    systemPrompt: "./prompts/v3.md",
    tools: ["file_edit", "terminal"],
    maxTurns: 25,
  }
});

// Run against a cohort
const results = await runCohort({
  harness: baseline,
  suite: "swe-bench-50",
  runs: 3,
});

// results.score → 0.72
// results.regressions → []
// results.cost → $4.21
                    

How it works

Define your harness

Capture every scaffold decision as config: system prompts, tool schemas, routing rules, retry logic, guardrails. Version it. Diff it. Review it like code.

Run cohort baselines

Execute the same task suite against multiple harness versions simultaneously. Statistical comparison, not vibes. Know which scaffold change improved quality and which introduced regressions.

Catch regressions automatically

Continuous baseline monitoring in CI. Every harness change gets a quality score delta. Regressions surface before they reach production, not after your users file bug reports.

Separate signal from noise

When an agent fails, was it the model, the prompt, the tool config, or the routing? HarnessKit isolates variables so you debug the right layer.

Same model. Better harness. That's the difference between 52% and 66% task success.

The harness is the product. Test it like one.