Showdown Thoughts: The Three-Pass Pattern

Model Showdown Round 5 ended with a leaderboard. Sonnet 4.6 won on the rubric. Opus 4.7 placed second. Qwen 3.5 contributed almost nothing structural. That's the measurement story.

This is the methodology story — what happened after the scores were revealed.

The Problem With Picking a Winner

The naive workflow after a bakeoff is: pick the best run, merge it to main, ship it. Winner takes all.

That's wrong, and Round 5 made it obvious why.

The winning run (Sonnet 4.6) had the best overall rubric score. It also had a weaker path validator than Opus 4.7, and its orphan-matching logic would have missed real-world cases that Opus 4.6 caught. The second-place run (Opus 4.7) had the best validator and the cleanest route structure, but the worst data source choice — reading from the build-time filesystem instead of the live GitHub Contents API.

No individual run was what I'd ship. Each one had at least one bad call. The bakeoff's real output wasn't a winner. It was a map.

When 4 of 4 models made the same design choice, that choice was obviously right. When they diverged — on validation strictness, on data source, on UX for destructive actions — that divergence was the signal. Those were the actual design decisions, the ones worth spending judgment on.

The Three Passes

What emerged from Round 5 is a pattern I've now run twice and would reach for again on any feature where the design space is unclear:

Pass 1 — Bakeoff. Run N models (I used 4) on the same prompt in isolated sessions. Judge blind, before you know which branch is which. Score against a rubric. The output of this pass isn't any of the N implementations — it's the decision map. You now know which choices are contested and which are obvious.

Pass 2 — Merge. Write down a merge plan before touching any code: for each contested layer, which run's approach wins and why. Then ask an agent to compose the merged best-of from those inputs. The merge is strictly better than any individual bakeoff run because it draws on information none of the bakeoff contestants had — the scored comparison of all four.

For Round 5 the plan looked like this:

Layer	Source	Why
Path validator	Opus 4.7 (Run 1)	Only run with 2-segment enforcement + `..` block + non-empty checks
Three-tier orphan match	Opus 4.6 (Run 2)	Only run that noticed exact-match missed real cases like `day-four`
Type-narrowed body parsing	Sonnet 4.6 (Run 3)	`typeof body === "object" && "path" in body`, no `as` casts
GitHub Contents API	Opus 4.6 / Sonnet 4.6	Live state vs. build-time filesystem snapshot
Confirm-modal UX	Sonnet 4.6	Best visual polish in the screenshots

Qwen 3.5 contributed nothing structural to this table. The bakeoff said "skip this one" clearly enough that there was nothing to debate. That's useful information too — knowing which pieces to skip is part of the map.

The merge was 13 files changed, +990/-9. One TypeScript error caught and fixed. Build passed first try after that. Opened as a PR with the heritage table in the description so future reviewers can trace any decision back to its source run.

Pass 3 — Polish. The merged feature went live. I opened it against real production data and spotted four things immediately: truncated directory names with no tooltip, delete buttons invisible on touch devices, no bulk delete UI despite the API supporting paths: [], and an orphaned section header that would show with count 0 after the lone orphan was deleted.

None of those were predictable before live use. You can't predict friction from a code review — you observe it. The polish pass had to come after the merge because the artifact it was polishing didn't exist until then.

The polish was 6 files changed, +265/-54 and about 20 minutes of agent time.

When to Use It

The pattern has a real cost: the bakeoff is N full agent sessions, each producing a complete implementation that you won't ship. For Round 5 that was ~$35 in inference and a few hours of judging.

That's cheap insurance when the feature has any of these properties:

Destructive verbs. Delete, update, payment, permission change. The cost of getting validation wrong outweighs the cost of the bakeoff.
Multiple defensible architectures. Where should validation live? What's the data source? How does auth thread through? When you genuinely don't know the right answer, a bakeoff shows you the option space.
Hard to change later. Database schemas. Public API contracts. Anything that will accumulate callers.

It's overkill for a 20-line UI tweak or a feature with a single obvious implementation. The signal value of the bakeoff scales with how uncertain you are about the design.

What I'd Do Differently

Three things I'd change for the next run:

Name the contestant chats before pasting the prompt. All four Round 5 chats showed up as "New Chat" in the Coder API cost summary, which meant 20 minutes of token-volume detective work to figure out which cost belonged to which run. Five seconds of effort would have prevented that.

Capture per-phase stats. I have clean bakeoff numbers. I don't have separate merge or polish numbers — they're folded into the judging thread. A lightweight wrapper script around each phase would make the next iteration measurable end-to-end.

Write the polish friction items down before fixing them. I noticed four issues and fixed them in one pass, which collapsed the "observed" list and the "fixed" list into the same moment. Separating them — even by five minutes — would have made the "what does live-review surface" lesson sharper for the writeup. And occasionally you'll notice something that isn't worth fixing.

By the Numbers

3 phases: Bakeoff (4 parallel attempts), Merge (1 informed pass), Polish (1 live-review pass)
4 implementations produced in the bakeoff, 0 shipped to main as-is
3 of 4 bakeoff runs contributed at least one structural piece to the merge
13 files changed in the merge pass (+990/-9)
6 files changed in the polish pass (+265/-54)
4 friction items caught in polish that couldn't have been predicted before live use
~$35.56 inference cost for the bakeoff phase
~45 min bakeoff (parallel), ~30 min merge, ~20 min polish

The Problem With Picking a Winner

The Three Passes

When to Use It

What I'd Do Differently

By the Numbers

Comments