Human overview · for understanding
Why its green lied — and the plan to fix it (Instance 1 · plan-explore) · 2026-06-22
Why its green lied — and the plan to fix it (Instance 1 · plan-explore)
Master summary — the gist in 30 seconds
Input: a live web app + a hand-written 'expected behaviour' map. Output: a report whose green you can actually trust, because it asserts state changes (an email sent, a card moved) instead of 'the page looks fine'.
flowchart LR
A["Drive the app"] --> B["Screenshot"]
B --> C{"Looks right?"}
C -->|yes| D["GREEN"]
D -.->|but the email never sent,\nthe card never moved| E["Real bug missed"]
C -.->|the new way| F["Also check the DATA:\ndid the state actually change?"]
F --> G["Trustworthy green"]
Input: a click. Output today: 'the screen rendered'. What's missing: 'and the stage moved / the draft was created / nothing was sent to a real lead'.
flowchart TD Click["User clicks Send"] --> Pixels["Pixels: modal looks fine"] Click --> Truth["Truth: stage did NOT move"] Pixels --> Judge["Old judge sees only this"] Truth -.->|invisible| Judge Judge --> Green["Says GREEN"]
Input: the owner's hand-test report. Output: 8 named blind spots (a transition, a side-effect, a transient, a data-semantic, a binding, a cascade, a latency) the new tool must catch forever.
mindmap
root((8 manual bugs))
Transition
BUG06 card wont move stage
Transient
BUG01 no loading state
BUG05 no send feedback
Side effect
BUG03 booking missing from history
Data semantic
BUG07 outgoing mail logged as incoming
Binding
BUG02 dashed border stays
Surface detail
BUG04 strike-through missing
Latency
BUG08 board lag
Input: the failure analysis from 3 reviewers who replayed the real run. Output: three groups of fixes that together make the green mean something.
flowchart LR P1["A · Resilience\ncrash-safe + settle-aware"] --> Sum["Trustworthy tool"] P2["B · Oracle\nmap behaviour, assert state deltas"] --> Sum P3["C · Trust\ncoverage ledger + human sign-off"] --> Sum
Input: a frozen, human-signed list of per-step expectations (surface + state + side-effects + the immediate click feedback). Output: a verdict that is green only if pixels AND data AND interaction all agree.
flowchart TD
Map["Map expected behaviour"] --> Sign["Human signs the oracle"]
Sign --> Drive["Drive the step"]
Drive --> Pre["Read data BEFORE"]
Drive --> Post["Read data AFTER"]
Pre --> Delta{"Did the required\nstate change happen?"}
Post --> Delta
Delta -->|yes + pixels ok| Green["GREEN"]
Delta -->|no| Miss["STATE_MISSING"]
Input: the full 16-fix spec. Output IN: crash-safe driver, contradiction gate, settle/UNSETTLED, the oracle + data-probe + new verdicts, the coverage ledger, lightweight human sign-off. Output DEFERRED: structured-Gemini, docs, resume, reconstruct, locator-map.
flowchart LR
subgraph NOW["Build now"]
F1["Crash-safe driver"]
F2["Contradiction gate"]
F10["Settle / UNSETTLED"]
F12["Oracle + data-probe"]
F15["Coverage ledger"]
F16["Human sign-off"]
end
subgraph LATER["Phase 2"]
D["Docs, resume,\nreconstruct, locator-map"]
end
NOW --> Proof["1 live golden-journey run\nproves it catches the 8 bugs"]
Input: this locked intent. Output: a fresh chat runs /plan-techspec to turn it into an exact checklist, then a third runs the build test-first and proves it live.
flowchart LR I1["Instance 1\nplan-explore (done)"] --> I2["Instance 2\nplan-techspec -> checklist"] I2 --> I3["Instance 3\nbuild test-first"] I3 --> Live["Live golden-journey run\nmust flag all 8 bugs"] Live --> Trust["Trustworthy green"]