vitaliikapliuk/modelharness

61 stars · Last commit 2026-06-14

Make every model cheaper or better. Zero-config behavioral harness for Claude Code, with a reproducible 408-run benchmark

PluginTDD & Testing Debugging Documentation DevOps & Infra Productivity AI & Prompting Data & ML

README preview

# modelharness

**Make every model cheaper or better. Measured on four Claude models — none got worse.**

![benchmark: 408 runs · 100% pass](https://img.shields.io/badge/benchmark-408_runs_·_100%25_pass-2da44e)
![4/4 models improved · none worse](https://img.shields.io/badge/4%2F4_models_improved-none_worse-8250df)
![license: MIT](https://img.shields.io/badge/license-MIT-57606a)

**What it actually is:** a zero-config Claude Code plugin. On every session start it injects a ≈910-token behavioral core — six working practices distilled from how Fable 5 was trained to operate — plus three on-demand skills and a fresh-context verifier agent. No commands to learn; the model just starts working differently:

| | |
|---|---|
| ✅ **Grounded progress** — only claims backed by a tool result; "tests fail" said plainly | ⚡ **Act, don't overplan** — enough information means act, not narrate options |
| 🎯 **Autonomy calibration** — decides minor things itself, asks only on scope or destructive actions | 🔍 **Self-verification loops** — a checkable definition of done, real checks on a cadence, a fresh-context verifier before "done" |
| 🔀 **Delegation triggers** — explicit rules for when to fan work out to subagents | 📝 **Cross-session memory** — writes lessons and plans to files, so the next session can pick up the work |

## What the 17 tasks test

| | | |
|---|---|---|

View full repository on GitHub →