heliohq/ship
56 stars · Last commit 2026-04-22
An agentic development harness for Claude Code, Codex & Cursor: gated pipeline from spec to green checks.
README preview
# Ship: AI-Powered Software Development Harness
> An agentic development harness for Claude Code, Codex & Cursor: gated pipeline from spec to green checks.
Ship orchestrates end-to-end software development — planning, implementation, E2E tests, review, QA, simplify, and PR creation — with quality gates at every transition.

## How It Works
Ship is a harness, not a copilot. It doesn't help AI write code — it constrains AI to produce reliable results through mechanically enforced quality gates.
**The problem Ship solves:** AI coding agents are capable but unreliable. They skip tests, hallucinate about code they haven't read, review their own work and call it good, and declare victory without evidence. Ship makes these failure modes structurally impossible.
- **Every phase is an isolated subagent.** The reviewer has never seen the implementation context. The QA evaluator can only see the spec, the diff, and the running application. Fresh context per phase means no accumulated bias.
- **Plans are adversarially tested.** An independent peer challenger produces code-grounded objections with file paths and snippets. The planner must respond with evidence, not hand-waving. Two rounds before you see anything.
- **Evidence is hierarchical.** L1 (screenshot, curl response, console log) is the only acceptable proof. L2 (HTTP 200, "tests passed") is insufficient. L3 ("should work based on the code") is an automatic FAIL.
- **State lives on disk, not in memory.** The current phase is tracked in a local state file. On resume, the orchestrator reads it and picks up where it left off. A stop-gate hook blocks session exit while the pipeline is active.
- **The finish line is checks green, not PR created.** After opening the PR, Ship enters a fix loop — read CI failures, dispatch fixes, address review comments, resolve merge conflicts — up to 3 rounds before escalating.
- **Test-driven implementation.** Stories follow a RED-GREEN-REFACTOR cycle with per-story code review before merge.