vibescoder

The Audit That Found The Thing The Audit Didn't Find

·16 min read

I added an MCP server to my fitness tracker last week. That meant a new authentication surface — a bearer token that unlocks the API for agents — sitting alongside the existing Google sign-in. New attack surface, new opinions about token handling, new ways to get it wrong. So I asked an agent to audit the repo.

Four hours later I had nineteen findings, all fixed, all committed, all pushed to main. I felt great.

Then I opened the app and the dashboard was empty.

This is a post about what that audit found, what I shipped, and the much more interesting things the audit didn't find — the ones that only surfaced because I fixed the things it did.


The audit

I sent the agent into the repo with one instruction: do a security review, focused on the MCP addition but covering everything that surface touches. It came back with a 26 KB markdown report. Severity-graded, file-and-line references, recommendations, the works.

The headline finding was a real one. The MCP commit had added a bearer token (MCP_API_TOKEN) that the middleware accepted on every /api/* route as a session substitute. The token itself was implemented correctly — constant-time compare with length pre-check, never logged, only accepted via the Authorization header. But the route handlers were all using a checkAuth() helper that was non-blocking by design. A leftover from working around a NextAuth v5 beta quirk: auth() returns null in Route Handlers on serverless platforms even when the user has a valid session. The original author papered over it by logging the null and proceeding anyway, trusting the middleware to be the real gate.

That's a perfectly defensible decision if you remember why it's there, and a foot-loaded shotgun the moment you forget. One middleware bypass and nothing else stops the request from reading the database.

So that was the High-severity finding. The rest of the audit was mostly the kind of thing audits find: missing CSP and HSTS headers, no rate limiting on the MCP endpoint, an in-app /api/migrate route that did DDL with a NODE_ENV check as its only guard, plaintext credential storage despite the README claiming otherwise, verbose logging of request bodies. Plus the usual dependency advisories — the installed framework version was a major release behind the fix range for a stack of published CVEs.

I asked the agent to fix things in batches by severity. Four batches: High, Medium, Low, Info. Each batch its own commit on main, each commit followed by a green CI run.

batch-1/high   — authoritative auth gate, drop /api/migrate, bump framework version
batch-2/medium — SSRF fix, credential encryption, rate-limit, audit log, CORS, matcher, log hygiene, framework auth bump
batch-3/low    — CSP + HSTS, drop dead config, sync-warning comments
batch-4/info   — document token blast radius, add CI audit, prevent token-smuggling regressions

checkAuth() got promoted to requireAuth(request), returning a 401 NextResponse when neither a session nor a valid bearer was present. Every API handler grew an early-return on the NextResponse result. The MCP self-fetch SSRF surface got eliminated by extracting the Peloton and Tonal sync logic into shared library functions that both the REST routes and the MCP tools call directly. Credential columns got envelope-encrypted with AES-256-GCM via a new CREDENTIAL_ENC_KEY. The /api/migrate route got deleted entirely. Rate limiting and audit logging landed. CSP, HSTS, all the headers. Framework bumped one major to clear the published advisories.

Sixteen-something files changed across the four commits. Zero build failures. CI passing. I pushed the last commit, Vercel redeployed, I went to look at the app.

The dashboard was empty.

The first thing that wasn't in the audit

The auth fix had done exactly what it was supposed to do: requireAuth() now correctly returned 401 when the route handler couldn't read a session. The problem was that the v5 beta bug it was working around hadn't gone away. So now every API request from my logged-in browser session was 401-ing, because auth() was still returning null in route handlers, and the new requireAuth() had nothing to fall back to.

The audit had flagged the v5 beta version as a finding (F-08) and recommended bumping to the latest beta. I did. The newest beta still had the bug. The original checkAuth() had been a workaround. By fixing the workaround, I had unfixed the workaround's reason for existing.

The fix was to make requireAuth() smarter: when auth() returns null, fall back to getToken() from next-auth/jwt, which reads the session cookie directly from the request and verifies the JWE signature against NEXTAUTH_SECRET. Same cryptographic check the middleware does. If it verifies, we synthesize a minimal Session shape from the decoded JWT so handlers reading session.user.email still work. If it doesn't verify, then we 401. Not "cookie present means trust" — that would have walked us right back into F-01.

fix(auth): restore browser data — verify session JWT directly in requireAuth

I pushed, Vercel redeployed, I refreshed the page.

The dashboard was still empty.

The second thing that wasn't in the audit

This is the one that stopped me in my tracks, and it's the reason I'm writing this post.

I asked the user — me, on my phone — to hit the home page in a fresh tab. The page loaded, no redirect. No "please log in." The dashboard chrome rendered with empty charts.

That's impossible. If you're not signed in, middleware is supposed to redirect you to /login. Empty data on a rendered dashboard with no session is a state the app shouldn't be able to reach.

I added a debug endpoint to dump what the server saw. Cookies, env vars, auth state. The browser session had two cookies: __Host-authjs.csrf-token and __Secure-authjs.callback-url. Both artifacts of a partial sign-in flow. The actual __Secure-authjs.session-token — the cookie that proves you signed in — was missing.

So I wasn't signed in. But the dashboard was loading. Therefore the middleware wasn't redirecting. Therefore the middleware wasn't running.

I checked the build output. The Next.js build prints a route table at the end. Every route I'd expect, but no ƒ Middleware XX kB line, which Next.js shows when middleware is compiled in. I dumped .next/server/middleware-manifest.json:

{ "middleware": {}, "sortedMiddleware": [] }

Empty. No middleware compiled into the build. None. The file at the project root, the one I'd been editing in two of the four audit batches, was being silently ignored by Next.js.

I checked out the pre-audit commit. Built that. Same result — empty manifest. So this wasn't a regression from the security work. The middleware had been a no-op since the Next.js 14→15 upgrade, possibly longer.

The reason, once I worked it out, is one of those things that is obvious in retrospect and invisible until you trip on it: when a Next.js project uses a src/ directory layout — which this one does, all code lives under src/app/middleware.ts must live at src/middleware.ts, not at the project root. The root-level file is silently ignored. No warning. No build error. The matcher config is parsed and discarded. The bundle is built without it.

I moved the file:

git mv middleware.ts src/middleware.ts

Updated the relative import from ./auth to ../auth. Built. The route table now showed:

ƒ Middleware    87.2 kB

The audit had not found this. It had reviewed the middleware file, recommended hardening to the matcher pattern (F-03), and added a sync-warning comment about the timing-safe compare (F-11). None of those changes did anything on the running site, because the file they applied to was being thrown away every build. The security implication was significantly worse than any High in the audit report: every page route in the application had been served with no server-side auth gate at all for an unknown amount of time. The API was gated only by the route-handler checkAuth() (which, per the audit's own F-01 finding, was non-blocking). Anyone who guessed the URL could fetch the prerendered dashboard HTML.

The audit couldn't see this because it was reading source files, not build artifacts. The agent did exactly what it was asked to do. The thing it was asked to do didn't include "verify that the files you're reviewing are actually being deployed."

I pushed the move. Redeployed.

fix(middleware): move middleware.ts into src/ so Next.js actually loads it

The dashboard now redirected unauthenticated callers to /login, as it should have been doing all along. I clicked Sign in with Google.

Google said Access blocked: Authorization Error. The OAuth client was not found. Error 401: invalid_client.

The third thing that wasn't in the audit

When middleware started actually running, it started actually requiring authentication. Which meant I had to actually sign in. Which meant the Google OAuth flow had to actually work. Which it didn't.

I traced the redirect URL the app was sending to Google:

https://accounts.google.com/o/oauth2/v2/auth?response_type=code&client_id=undefined&...

client_id=undefined. The literal string. process.env.GOOGLE_CLIENT_ID was unset on Vercel.

My first theory was an Edge-runtime bundling issue. The new src/middleware.ts imports auth from ../auth, which pulls the entire NextAuth config — including the Google provider — into the Edge bundle. The Google provider reads GOOGLE_CLIENT_ID at module load time. Edge functions run in their own isolate. If the env var wasn't readable when the Edge isolate cold-started, the provider would cache clientId: undefined for the lifetime of that instance.

So I rewrote src/middleware.ts to not import from ../auth at all. Same session-cookie verification using getToken() from next-auth/jwt directly, which reads NEXTAUTH_SECRET at call time, not at module load. The middleware bundle dropped from 87.2 kB to 45.7 kB — the Google provider was gone. Ship it.

client_id=undefined.

OK, so that wasn't it. Theory two: a debug endpoint that just dumps process.env[*] boolean (set/unset, not values). Bearer-gated so I can call it from my workspace without exposing it publicly. The output:

{
  "NEXTAUTH_SECRET":      "SET (len=44)",
  "GOOGLE_CLIENT_ID":     "UNSET",
  "GOOGLE_CLIENT_SECRET": "UNSET",
  "ALLOWED_EMAIL":        "SET (len=19)",
  "MCP_API_TOKEN":        "SET (len=64)",
  "CREDENTIAL_ENC_KEY":   "UNSET",
  "APP_BASE_URL":         "UNSET",
  ...
}

GOOGLE_CLIENT_ID and GOOGLE_CLIENT_SECRET were not set on Vercel. Period. Had never been set, in this Vercel project, ever. The fitness tracker had been deployed to Vercel for a year and there was no Google OAuth client backing it.

That meant one of two things. Either the env vars had been there and got purged (unlikely — no recent change to env var configuration), or they had never been there, and Google sign-in had never worked.

I asked. Confirmed: never set up. There was no Google Cloud project. There was no OAuth client. There was no GOOGLE_CLIENT_ID to set.

This is where my model of the situation broke. The app's auth.ts had referenced process.env.GOOGLE_CLIENT_ID! since the commit 73fdaab refactor: archive mobile app, add Google OAuth, remove sensitive data — pre-dating any of the audit work. There was no plausible path where Google sign-in had been working before and stopped working now. So how had I been getting into the app for the past year?

The answer, of course, is that I hadn't been. Middleware wasn't running. The page route was being served as static HTML, prerendered at build time, to anyone who asked. The "session" displayed on the dashboard was a fiction generated entirely client-side from API responses that had no auth on them either (because checkAuth() was non-blocking). I had been using an unauthenticated copy of my own app, with the database open to the public internet via API routes that thought they had auth.

This is the part where I want to be careful about how I talk about it, because it makes me sound careless and it makes the agent sound impressive, and neither framing is quite right. The careless one is me. I shipped the app without testing the auth flow against a real Google OAuth client. The agent did exactly what it was hired to do: read the code, identify weaknesses, suggest fixes. It correctly identified that the auth surface was the most important part of the app and prioritized it. It did not — could not — know that the auth surface had never been wired up to anything real in production, because the code looked correct, the env-var references were syntactically present, and the README claimed the feature worked.

The audit found patches for nineteen specific defects. The audit did not find that the entire auth feature had been broken at the infrastructure layer for the life of the deployment. To find that, you have to do something the audit wasn't asked to do: try to use the app like a real user, on the real production deployment, with no shortcuts.

The walk-through to fix it was the standard Google Cloud OAuth setup — new project, configure consent screen, create Web application credentials, paste the redirect URI exactly (https://your-app.vercel.app/api/auth/callback/google, no trailing slash), copy the Client ID and Secret to Vercel env vars, redeploy. About ten minutes once we knew that's what we were doing. Sign-in worked on the first try.

chore: remove temporary debug endpoints

Lessons

This whole thing was supposed to be a clean security audit. Instead it was a clean security audit followed by three larger problems that the audit couldn't have caught, each one only visible because the previous one had been fixed. The order matters, because each fix peeled back a layer of compensating behavior that was hiding the next problem.

Audits find what they're asked to look at. This sounds tautological but it's the whole post. The agent did the work I asked for, well. The pattern that nearly bit me was that the work I asked for had implicit assumptions — that the files I was asking it to review were actually deployed, that the env vars referenced in the code were actually set, that the auth flow it was reviewing was actually used — and those assumptions were all wrong. Static code review is necessary and insufficient. Dynamic verification on the running app is the other half, and I didn't ask for it.

Compensating behavior hides bugs. checkAuth() was non-blocking because of the v5 beta auth() bug, but the comment said "middleware already verified auth" — and that wasn't actually true, because middleware was a silent no-op. Two bugs that, together, looked like working software. Either one alone would have produced a 401 or a redirect and gotten my attention. Together, they produced an app that loaded with empty data and made me feel like everything was fine. The audit fixed checkAuth(). That fix immediately exposed the middleware-not-running bug. Which immediately exposed the OAuth-never-configured bug. There was no shortcut to the bottom of the stack — each layer had to be peeled in order.

The framework will not tell you when it's ignoring you. Next.js silently discards a project-root middleware.ts when src/ is present. No warning. No build error. The route table doesn't mention it. The middleware manifest is just empty. There is probably a documentation page that says this; I have not gone looking. The point is that the failure mode is invisible from the project's perspective, and the way I caught it was by inspecting a build artifact (.next/server/middleware-manifest.json) that I had no prior reason to look at. If your framework supports both src/ and non-src/ layouts and the same config files work differently in each, at least one of those layouts is going to silently swallow somebody's config and that somebody is probably going to be a vibe coder who doesn't know which layout they're using.

Test the production deployment with a real account, before declaring victory. I shipped a year-old app and never noticed it didn't have auth. Not because I was lazy — I'd reviewed the code, I'd seen the signIn callback, I'd seen the ALLOWED_EMAIL guard — but because I'd been opening the app from my own browser, every day, and it had been showing me my data, every day. The behavior I expected matched the behavior I observed. The behavior I expected was a complete fiction. The only test that would have caught it is the dumb one: sign out, open an incognito tab, hit the URL, see what happens. I never did, because the cost of being wrong felt low, until it suddenly didn't.

Cleanup matters. I shipped six "debug" commits getting to the bottom of these problems — endpoints that dump env state, middleware headers that report what the matcher decided, query parameters that short-circuit the handler with a JSON response. Every one of them was correctly bearer-gated or explicitly scoped to be safe-while-public, and every one of them got removed in 413e282 once we knew what was happening. But "remove the debug endpoint when you're done" is a footnote you forget at your own peril, and the chat transcript above has at least two moments where the agent had to be reminded to clean up. Worth noting that the cleanup commit happened after the user said "we're good" — not in the middle of the debugging, when it would have been forgotten.

By the Numbers

  • Audit findings: 19 (2 High, 8 Medium, 4 Low, 5 Info)
  • Audit batches shipped: 4
  • Audit-batch commits: 4 (one per severity)
  • Total commits over the session: 18, including six debug-and-revert pairs
  • Post-audit critical bugs discovered: 3 (v5 beta auth() returning null in handlers, middleware file in wrong location, Google OAuth never configured)
  • Files that were silently being ignored by the framework: 1 (middleware.ts at project root)
  • Months that file had been a no-op: unknown, ≥ 3
  • Days the audit report had been blessing the no-op file: same number
  • Build manifest line that revealed it: "middleware": {} in .next/server/middleware-manifest.json
  • Time from audit-handoff to "everything passing again": ~4 hours
  • Of which time spent on the audit batches: ~1 hour
  • Of which time spent on the three things the audit didn't find: ~3 hours
  • OAuth clients created in Google Cloud Console during this session: 1
  • Vercel env vars deleted as orphaned afterwards: ~15
  • Bytes of debug endpoints removed in the cleanup commit: a few thousand
  • Trust in the agent: unchanged — it did its job well
  • Trust in my own assumptions: significantly reduced

Comments