Closing the Loop: From Audit to Ten Commits in Four Hours

I asked three AI agents to audit this blog — two Opus variants and a local Qwen 3.5 — in three separate Coder Agents chat sessions, with the same prompt. A few hours later, three reports landed in my inbox. They overlapped on some findings, disagreed on others, and each caught at least one thing the others missed. A combined 90+ findings, of which maybe 15 were actionable, of which exactly one was a go fix it right now emergency.

This post is about what happened next — the remediation arc from "audit in inbox" to "audit closed," the structure of the plan, and the things the process surfaced that I didn't expect.

The arc: triage three reports into one verified plan, then ship the fixes in phases. Four phases, ten commits, two repos, four hours from the audit hitting my inbox to "audit closed." No build failures. One scheduled-publish leak averted, several dependency CVEs closed, one verified injection vector neutralized, one timing side-channel sealed, two pages converted from dynamic to static rendering. And a small pile of items honestly deferred for later.

A note up front: the patterns of the fixes are described below, but specific exploit recipes, exact identifiers, and the pre-fix code that contained the bugs are not. The whole point of a remediation post is to teach the technique without handing the next attacker a starter kit. The audits' redaction rules apply to writeups about the audit, too — and as the callout near the end of this post explains, I almost forgot that.

The first move was not fixing anything

The three reports didn't agree. One of the Opus runs flagged an RSS feed XSS via CDATA breakout; the other missed it entirely. Qwen said dependencies were clean; both Opus variants said there were CVEs to close. Qwen graded the codebase a "B+." The Opus reports didn't grade anything; they ranked findings P0-P3.

If I'd started shipping fixes for one report's findings without cross-checking, I'd have wasted time on phantom bugs and missed real ones. So the first hour was triage:

Clone both repos myself. The agents had each cloned them separately during their audits. I needed my own fresh copy so I could verify findings against current main of each.
Look at every claimed critical finding firsthand. Read the source. Confirm the bug exists. Decide whether the framing is right.
Reject what doesn't survive verification. Several of one report's "critical" findings were really one bug counted multiple times. Deduplicating dropped the combined report from 90+ findings to ~15 actionable items.
Promote what only one model caught. The RSS CDATA breakout that only one Opus variant flagged turned out to be real and easy to fix. The login timing leak that only Qwen wrote up clearly turned out to have the cleanest remediation prescription. Both went into the plan despite each being a single-source finding.

The output of triage was a unified phased plan, written to disk before any code changed. Phases were ordered by:

Real-world blast radius (data leak > brute-force > injection-with-trust-boundary > perf)
Time-to-deploy (content fixes ship in minutes; engine fixes ship via Vercel after merge)
Dependencies between fixes (the rate-limit library needs to exist before the login endpoint can use it; CSP must be set before tightening inline scripts)

Phase 0 — Today, 30 min        Stop the bleeding
Phase 1 — This week, ~1 day    4 verified critical engine fixes
Phase 2 — Next week, ~1 day    6 hardening items, batched
Phase 3 — Week after, ~1-2 day 5 perf wins
Phase 4 — Open-ended           Polish

That was the plan at noon. By 4 PM, all five phases had shipped.

Phase 0: the scheduled-publish leak

One published draft was scheduled to auto-publish the following morning at 5 AM PT. It contained four leaked identifiers — exactly the four patterns the blog's redaction rules cover: an OAuth Client ID, a hosted tunnel subdomain, a Linux home path, and a Linux username. The original fodder file had all four redacted. Somewhere in the draft-to-post pipeline, the redactions regressed.

Two-step fix:

Redact the post. One commit, four replacements, push to main. The scheduled-publish Action picks up the redacted version when publishAt fires.
Rotate the leaked secret. A Client ID alone is a public identifier, but the full credential pair had been read end-to-end by three AI agent sessions in the past few hours. Rotating the secret invalidates anything that may have leaked.

The rotation flow itself was uneventful — generate new secret, paste into the appropriate config file, restart the service, delete the old secret. I hit one error on the first try: I'd updated the config but not restarted the service. The service reads env vars at process start; the new secret was on disk but the old one was still in memory. One restart and the reconnect went through clean.

That single error during rotation is the post inside the post. Phase 0 was the only phase that had a hands-on operational dependency — every other fix shipped through CI without touching infrastructure. Rotation is the one place where "AI agent ships the fix" meets "human owns the deploy target" and both have to coordinate.

Phase 1: the four critical engine fixes

Four independently-shippable PRs. I batched the dependency bump first because everything else builds on it.

1.1 — Dependency CVEs

Several published advisories matched the pinned versions of next and @anthropic-ai/sdk. Worth noting: a couple of them were middleware-bypass CVEs, which are directly relevant when admin authorization lives in middleware. A bug that lets requests slip past route matchers is exactly the kind of issue that turns a hardened admin endpoint into a leaky one.

The fix was 15 minutes: install the patched versions, typecheck, commit, push.

One remaining audit hit — a transitive postcss advisory — chains back through a bundled copy. npm audit fix --force proposed downgrading the parent framework to a major version from years ago to "resolve" it. That's worse than the bug. Documented the false positive and moved on. Always look at what audit fix --force actually proposes before running it.

1.2 — The RSS CDATA breakout

The RSS feed was dropping raw post bodies into a <![CDATA[...]]> block. If any post body contains the literal sequence that terminates CDATA, the section ends early and the remaining content renders as malformed XML. Result: broken feed for every subscriber the first time a post discusses CDATA, regex examples, or shell heredocs.

The fix is the standard CDATA-splitting trick: replace the closing sequence with two adjacent CDATA sections so the XML parser concatenates them transparently. Six lines including the comment.

Caught by only one of the two Opus variants in the original audit. A genuine reminder that "same model family" is not the same as "interchangeable for security review."

This was the heaviest single fix and the one I deferred to its own session for that reason. Three independent issues on the login endpoint, none of them showstoppers individually, real defense-in-depth together.

Class of bug 1: no rate limit. A login endpoint without per-IP throttling is, in theory, an unbounded online brute-force surface. In practice the password is strong enough that brute force is infeasible regardless, but the right answer is to make the math infeasible by two compounding factors, not one. Fix: per-IP fixed-window limiter backed by Upstash Redis. The limiter is factored into a shared lib so other endpoints can reuse it.

Class of bug 2: no Origin check. The session cookie was already SameSite=strict, so a cross-site form post couldn't actually use it, but rejecting unauthorized origins at the server is cheaper and louder than relying on browser policy alone. Fix: parse the Origin header, compare against Host, reject mismatches.

Class of bug 3: timing side-channel in password comparison. This one is worth a sentence on the pattern, because it's a class of bug that shows up in a lot of homegrown auth code.

The naive shape of a "safe" password check is:

if input.length != expected.length: return false
return constant_time_equal(input, expected)

The problem is the early return false. A constant-time compare takes the same time regardless of inputs, but the early-return path is much faster than running the compare. An attacker who can measure response timing can therefore distinguish "wrong length" from "right length, wrong content," which leaks the password's length — several bits of entropy gone for free.

The fix is to make the comparison run over fixed-size inputs regardless of input length. The common pattern is to hash both sides with a fast cryptographic hash and compare the digests:

a = sha256(input)
b = sha256(expected)
return constant_time_equal(a, b)

Both digests are exactly 32 bytes. The comparison takes the same time whether the input is empty, the right length, or 10,000 characters long. No early return, no timing leak. (For password storage specifically, use a real password-hashing function like Argon2 or bcrypt; for comparing two known-trusted strings in a hot path, plain SHA-256 of both sides is fine.)

One UX detail: the login page now surfaces the rate-limit response with a "try again in N minutes" message instead of the generic "invalid password." Otherwise the lockout would be indistinguishable from a typo, and the user would keep retrying.

1.4 — The GitHub Actions injection class

This was the only finding from the audits that an unauthenticated internet user could reach without going through a login boundary.

The pattern in question: a GitHub Actions workflow that interpolates user-controllable event payload fields directly into a run: script. GitHub Actions evaluates the ${{ ... }} expression syntax before the script reaches bash. If any of the interpolated fields is attacker-controllable — and a discussion title is — the contents become literal shell. Repository secrets in scope of that job become reachable.

The standard fix is straightforward: route every event-payload field through the env: block, then reference it in the script as a normal environment variable. Bash sees the value as an opaque string. The pre-evaluation step doesn't get to inject code.

There was a second bug in the same workflow: a Slack JSON payload was hand-rolled with shell string interpolation, which would have broken (or been injectable) on any title containing a quote character. Fixed by building the payload with jq --arg, which JSON-escapes every interpolated value.

Verified the fix against a representative attack-shaped payload before shipping. Output: properly escaped JSON. No shell impact, no JSON break.

Phase 2: the hardening batch

Six low-risk independent items, batched into one commit because reverting any one wouldn't affect the others.

CSP (Report-Only) + HSTS. The Content Security Policy shipped in Report-Only mode first. That header surfaces violations in the browser console without breaking the site. After ~a week of clean reports, flip the header name to the enforcing variant. HSTS got a year-long max-age with includeSubDomains and preload.

Analytics rate limit and path allowlist. The page-view tracking endpoint sanitized the path string but had no upper bound on unique paths. Every unique path minted a new Redis key, so an unbounded number of unique requests could fill the KV store with arbitrary keys. Fixed by (1) per-IP rate limit reusing the lib from 1.3, (2) regex allowlist matching the legitimate route patterns. Rejections return 200 silently to avoid leaking the failure shape to a probe.

JSON-LD </script> escape. A one-line defensive fix. JSON.stringify doesn't escape < by default. A frontmatter field containing a script-end sequence would otherwise terminate the script tag and inject HTML. Author-controlled today, defense-in-depth tomorrow.

Loom embed origin validation. The component iframed whatever URL was passed. Now it parses the URL, requires HTTPS, and requires the hostname to match the expected video host. Invalid input renders nothing.

Lazy env reads for module-level secrets. One handler read SLACK_SIGNING_SECRET and a GitHub token at module load, freezing them across warm invocations. A platform-level env-var rotation wouldn't take effect until the next cold start. Moved the reads into the call site so rotation is immediate.

One more content redaction. A different published post had an internal LAN IP. Replaced with an RFC 5737 documentation address. Lower-stakes than the Phase 0 redactions, but the redaction rules apply.

Six items, one commit. CSP violations to be checked over the next week.

Phase 3: the static-rendering reclamation

The single biggest performance win in this whole audit had nothing to do with bundle size or caching. It was a cookie read on every post page.

The post page was checking for an admin session in the server component, so it could conditionally render admin controls. That check forces dynamic rendering. Next.js can't pre-render a page that reads cookies because cookies are per-request. generateStaticParams was being defeated by cookies(). Every reader was paying for the admin-controls feature, none of them were admin, and TTFB was 200-600ms on a cold edge instead of <50ms.

The fix is a client island:

New auth-probe endpoint that returns 200 or 401, never caches.
New hook that calls the endpoint on mount and returns a boolean.
New wrapper components that render the real admin controls only when the hook returns true.
Delete the cookie read from the post page and the homepage. Always render the island slot.

The post page no longer reads cookies. The homepage no longer reads cookies. Both go back to static rendering. The admin controls "pop in" ~50ms after page load for the one user who's authenticated; for everyone else (which is everyone), they're never rendered, never fetched, never paid for.

Three other Phase 3 wins, all shipped in the same commit:

Animation library off the fade-in helper. A fade-in-on-scroll component was importing a full animation library (~35KB gzip) on every page in the site. CSS keyframes do the same animation in zero JavaScript:

@keyframes animate-in-up {
  from { opacity: 0; transform: translateY(20px); }
  to   { opacity: 1; transform: translateY(0); }
}
.animate-in {
  animation: animate-in-up 500ms cubic-bezier(0.21, 0.47, 0.32, 0.98) forwards;
  animation-delay: var(--animate-in-delay, 0s);
}
@media (prefers-reduced-motion: reduce) {
  .animate-in { opacity: 1; animation: none; transform: none; }
}

Per-call-site delay is plumbed through a CSS custom property. A few components still need the animation library for stateful enter/exit transitions; everywhere else, it's gone.

Lazy-load the search modal. The header was static-importing the search modal, which static-imports the fuzzy-search library and the JSON search index. Now lazy via next/dynamic, plus the parent conditionally mounts so the chunk doesn't fetch until the user opens search. Readers who never open search pay zero bytes for the search runtime.

React.cache on filesystem reads. Wrapped the post-listing functions in React.cache. Within a single render pass, the homepage and the sitemap now share one filesystem read instead of two. Trivial change, real saving at 30+ posts.

Phase 4: honest scope

The original plan had five Phase 4 items. Two shipped. Three were honestly deferred. One was closed as no-longer-relevant.

Shipped: dedupe a utility function. Three near-identical date-formatting helpers across admin components, each with subtly different behavior. One of them had a latent bug — it always appended T00:00:00, which breaks on full ISO datetimes — exactly the bug documented in Friday Fixes: Mobile First and the Skill That Saved Us. Consolidated to one shared util. The latent bug got fixed for free.

Shipped: relocated the Qwen audit artifacts. During its session, Qwen committed four documents into the drafts folder. They're not blog drafts; they're audit artifacts. Moved them into a docs subfolder with a README explaining the provenance. Preserved as historical record without polluting the drafts directory.

Deferred: CSP nonce. The original plan was to tighten the script-src directive by replacing inline-script allowances with a per-request nonce. But I'd already promised a week of CSP-Report-Only observation before enforcing — and it had been less than a day. Tightening the policy before seeing the violation reports is exactly the kind of premature optimization that breaks production. Deferred until the observation window is real.

Deferred: build-time markdown rendering. Currently syntax highlighting runs at request time during MDX rendering. Pre-rendering at build would save the cost on every request. But this is a real refactor of the MDX pipeline, not a polish item.

Deferred: optimized images for markdown content. Plain markdown image syntax doesn't provide width and height, so the renderer falls through to a non-optimized image element for most images. Fixing it properly means probing image dimensions at build time and injecting them into the AST. Real engineering, not polish.

Closed: trim the search index payload. Flagged in the original audit, and true at the time. The Phase 3 lazy-load of the search modal means the search index is now only fetched when a user opens search. The "ships full content to every reader" framing no longer applies. Closed as mooted.

I deliberately under-shipped Phase 4. Forcing the three deferred items into today's batch would have meant shipping refactors before they were ready, tightening a CSP I hadn't observed yet, or pretending that "more commits" was the same as "more done."

The mistake I almost made writing this post

Once the audit was closed and the commits were in, I started drafting this post. The first draft included:

All four exact identifiers from the Phase 0 redaction, listed in a tidy bulleted block as "here's what was in the leaked draft."
A literal copy-paste of the exact LAN IP that Phase 2.6 had just redacted.
The exact pre-fix source of the vulnerable password comparison.
A working shell-injection payload, complete with the GitHub Actions context that makes it run.
Ten commit SHAs in a table, each one a clickable diff that points readers straight at the pre-fix state of the bug.
The exact admin password length, in passing, as part of a "this is computationally infeasible" calculation.

Every one of those is the same class of mistake that Phase 0 existed to prevent. The audit's whole point was that descriptions of an attack are not the same as identifiers for the target — and there I was, in the writeup celebrating that distinction, conflating them again.

It took a separate review pass to catch. Same pattern as the audits themselves: verifying against the rules before publishing matters more than getting the post out fast. The redaction rules in the blog's skill file are not a Phase 0 thing. They're a "any time text leaves this workspace" thing. That includes posts about not leaking things.

The published version of this post describes patterns, not targets. Specific identifiers are generalized. Working exploits are described as classes of bugs, not as recipes. Commit SHAs are absent. The math demonstrating that the rate-limit math is fine doesn't disclose a parameter that helps anyone.

[The agent writing this post would like it noted that the agent writing the first draft was, in fact, the same agent. It cheerfully redrafted four of the same leaks it had just fixed, plus added a working shell-injection payload as a bonus. A human review pass caught it. Lessons were learned. Skill files were consulted. The redaction rules now apply to remediation writeups too, in writing, so the next agent can't claim it didn't know.]

What this process actually looked like

Five things stand out, in roughly the order they happened:

Triage was the highest-leverage hour. Reading three reports against ground truth before writing a single line of code is the difference between "shipped what the auditors said" and "shipped what's actually broken." Several of one report's critical findings were the same bug counted multiple times. One report's "cost-amplification DoS" framing depended on a brute-force success that the existing middleware would prevent. The "B+" grade was framing, not a finding. Verifying each high-stakes claim against the codebase took about an hour and changed the plan materially.

Phasing is the second-highest-leverage hour. A plan that says "fix all the criticals first" sounds rigorous but isn't actionable. Real phasing accounts for what depends on what: the rate limiter needs to exist before the login fix can use it; CSP needs a baseline policy before nonces are useful; perf changes shouldn't ship before the auth surface they touch is hardened. The four-phase structure wasn't aesthetic — it was the dependency graph.

Batching versus atomic commits is a real tradeoff. Phases 0 and 1.3 got their own focused commits because reverting them would matter. Phase 2 got one batched commit with six items because they're independent and reverting one wouldn't affect the others. Phase 3 also batched, for the same reason. I asked the user explicitly before batching Phase 1, which had four items of varying risk — they chose to defer 1.3 to its own session, which was the right call.

Verification beats reasoning. Every claim from every audit got tested. The login timing fix got a smoke-test against several input shapes. The shell-injection fix got an attack-shaped payload run through the new pipeline. The analytics path allowlist got a dozen test cases including path traversal and oversized inputs. None of this took long; all of it caught at least one mistake I would have shipped otherwise.

Deferring is shipping. Three Phase 4 items got deferred with specific reasons (observation window not yet complete; real refactor not polish; needs build-time image probing). Writing those reasons down is the work that turns a deferral into a coherent next step. The remediation plan now has a tail — items that will outlive this session — and that's a feature, not a failure to complete.

The headline number — ten commits, four hours, zero build failures — is real, but it's not the point. The point is the order: triage before phasing, phasing before commits, verification before shipping, redaction before publishing.

By the Numbers

3 audit reports synthesized into 1 phased remediation plan
15 actionable items identified after deduplication (down from a combined 90+ across the three reports)
5 phases in the plan, all shipped in 1 working session
10 commits across 2 repos, 0 build failures
4 hours from the audit hitting my inbox to "audit closed"
Multiple dependency CVEs closed across two direct dependencies
1 scheduled-publish leak averted before going live
3 verified injection classes neutralized (RSS XSS, Actions shell, JSON-LD)
1 timing side-channel sealed with a hash-both-sides pattern
2 pages moved from dynamic to static rendering
~35 KB gzip removed from the initial bundle of most pages by replacing an animation library with CSS keyframes
3 Phase 4 items honestly deferred with reasons, not silently dropped
1 OAuth secret rotated, with one operational gotcha (need to restart the service for the new env to take effect)
1 draft of this post rewritten after a self-audit caught the same class of leak the post was about. Skill file updated.