Making visual-qa-ultra trustworthy

Why its green lied — and the plan to fix it (Instance 1 · plan-explore)

Master summary — the gist in 30 seconds

TL;DRThe automated tester reported mostly-green, but a human found 8 real bugs by hand. The fix: stop grading screenshots and start grading BEHAVIOUR — map what should happen first, then check the data, not just the pixels.

Input: a live web app + a hand-written 'expected behaviour' map. Output: a report whose green you can actually trust, because it asserts state changes (an email sent, a card moved) instead of 'the page looks fine'.

Why this mattersA screenshot can show that a screen looks right while the thing that should have happened behind it never did. That blind spot is exactly what let the tool pass a broken funnel — and what cost the owner trust.

flowchart LR
  A["Drive the app"] --> B["Screenshot"]
  B --> C{"Looks right?"}
  C -->|yes| D["GREEN"]
  D -.->|but the email never sent,\nthe card never moved| E["Real bug missed"]
  C -.->|the new way| F["Also check the DATA:\ndid the state actually change?"]
  F --> G["Trustworthy green"]

1 · The trust problem

TL;DRThe tool judges surfaces, not outcomes — so a non-event is invisible to it.

Input: a click. Output today: 'the screen rendered'. What's missing: 'and the stage moved / the draft was created / nothing was sent to a real lead'.

Why it mattersThere is no screenshot of a thing that didn't happen. If the only evidence is pixels, every 'should-have-happened-and-didn't' bug slips through as green.

flowchart TD
  Click["User clicks Send"] --> Pixels["Pixels: modal looks fine"]
  Click --> Truth["Truth: stage did NOT move"]
  Pixels --> Judge["Old judge sees only this"]
  Truth -.->|invisible| Judge
  Judge --> Green["Says GREEN"]

2 · The calibration set — what the human caught

TL;DR8 manual bugs, each a behaviour that isn't on the settled screen. They become the framework's permanent acceptance bar.

Input: the owner's hand-test report. Output: 8 named blind spots (a transition, a side-effect, a transient, a data-semantic, a binding, a cascade, a latency) the new tool must catch forever.

Why it mattersEvery miss had the same shape: real, but off-surface. Turning each into a checkable rule means the same class of bug can never silently pass again.

mindmap
  root((8 manual bugs))
    Transition
      BUG06 card wont move stage
    Transient
      BUG01 no loading state
      BUG05 no send feedback
    Side effect
      BUG03 booking missing from history
    Data semantic
      BUG07 outgoing mail logged as incoming
    Binding
      BUG02 dashed border stays
    Surface detail
      BUG04 strike-through missing
    Latency
      BUG08 board lag

3 · Three pillars of the fix

TL;DRResilience (never lose/misread a frame), the Oracle (assert expected behaviour), Trust (say what you did NOT verify).

Input: the failure analysis from 3 reviewers who replayed the real run. Output: three groups of fixes that together make the green mean something.

Why it mattersPlumbing fixes alone (no collisions, no fake greens) only make the tool faithfully report what it tests. The Oracle changes WHAT it tests; Trust makes its confidence honest.

flowchart LR
  P1["A · Resilience\ncrash-safe + settle-aware"] --> Sum["Trustworthy tool"]
  P2["B · Oracle\nmap behaviour, assert state deltas"] --> Sum
  P3["C · Trust\ncoverage ledger + human sign-off"] --> Sum

4 · The Expected-Behaviour Oracle (the centrepiece)

TL;DRMap what should happen FIRST, a human signs it, THEN drive — and assert the data changed, not just the pixels.

Input: a frozen, human-signed list of per-step expectations (surface + state + side-effects + the immediate click feedback). Output: a verdict that is green only if pixels AND data AND interaction all agree.

Why it mattersIf the success criterion is written and signed before the run, the tester can't rationalise whatever it sees as 'expected'. A missing stage-move becomes a hard failure called STATE_MISSING, not a quiet pass.

flowchart TD
  Map["Map expected behaviour"] --> Sign["Human signs the oracle"]
  Sign --> Drive["Drive the step"]
  Drive --> Pre["Read data BEFORE"]
  Drive --> Post["Read data AFTER"]
  Pre --> Delta{"Did the required\nstate change happen?"}
  Post --> Delta
  Delta -->|yes + pixels ok| Green["GREEN"]
  Delta -->|no| Miss["STATE_MISSING"]

5 · Scope for this build — trust-core first

TL;DRBuild the pillars that broke trust now; defer the polish to a phase 2.

Input: the full 16-fix spec. Output IN: crash-safe driver, contradiction gate, settle/UNSETTLED, the oracle + data-probe + new verdicts, the coverage ledger, lightweight human sign-off. Output DEFERRED: structured-Gemini, docs, resume, reconstruct, locator-map.

Why it mattersPonytail: ship the shortest path to a trustworthy tool and prove it, rather than boil the ocean. The deferred items are valuable but were not what made the green lie.

flowchart LR
  subgraph NOW["Build now"]
    F1["Crash-safe driver"]
    F2["Contradiction gate"]
    F10["Settle / UNSETTLED"]
    F12["Oracle + data-probe"]
    F15["Coverage ledger"]
    F16["Human sign-off"]
  end
  subgraph LATER["Phase 2"]
    D["Docs, resume,\nreconstruct, locator-map"]
  end
  NOW --> Proof["1 live golden-journey run\nproves it catches the 8 bugs"]

6 · The planning chain + how 'done' is proven

TL;DRplan-explore (here) → plan-techspec → execute. Done = every fix has a red→green test AND one live golden-journey run flags all 8 bugs.

Input: this locked intent. Output: a fresh chat runs /plan-techspec to turn it into an exact checklist, then a third runs the build test-first and proves it live.

Why it mattersLocking intent before building is what makes the end result amazingly good — the next instance starts from a settled scope instead of re-deriving it under time pressure (which is how the first run thrashed).

flowchart LR
  I1["Instance 1\nplan-explore (done)"] --> I2["Instance 2\nplan-techspec -> checklist"]
  I2 --> I3["Instance 3\nbuild test-first"]
  I3 --> Live["Live golden-journey run\nmust flag all 8 bugs"]
  Live --> Trust["Trustworthy green"]

Intent handoff (HANDOFF.md) →Deep technical spec (HARDENING_HANDOFF_v2.md) →The 3 critiques →Calibration set (8 bugs) →