The Spec-to-Kill Method: Pressure-Testing Vibe-Coded Infrastructure

Written with Claude.

Hero image: a Swiss army knife with too many blades fanning out, next to a clean magnetic knife block with four knives

The Swiss army knife you have to hold a certain way

I run a field data collection platform for building assessments. The app has AI agents that look up equipment costs, write condition narratives, validate data, and generate reports. Those agents need to find cost records in a database, do deterministic math (no LLM arithmetic on dollar amounts), and write results back to a JSON file.

In December 2025, I asked Claude to build me a cost lookup tool. Claude said I needed an MCP server with vector embeddings for similarity search. MCP (Model Context Protocol) was the new way to give AI agents tools: you run a small server, the agent discovers what tools are available, calls them with structured inputs, gets structured outputs back. Vector search was the retrieval mechanism: take a sentence like “3000-ton field-erected cooling tower, 2005 vintage,” run it through an embedding model that converts it to 384 numbers, then find other sentences in the database whose 384 numbers point in a similar direction. Cosine similarity. The closer the angle between two vectors, the more semantically similar the sentences.

We dug deep into this. The idea of semantic similarity was genuinely compelling. Show me everything in the database that sounds like a knockdown cooling tower from the early 2000s, even if nobody used those exact words. Not keyword matching. Meaning matching. We designed confidence tiers (HIGH at 40%+ similarity, MEDIUM at 25-40%). We tested end-to-end with 70%+ similarity scores on HVAC and plumbing queries. Claude built a brute-force cosine similarity search over the entire database on every query. It shipped as five tools. It worked.

On New Year’s Day 2026, Claude added a “Why MCP?” section to the README. It wrote its own architectural justification for the thing it had just built me. This is the actual text from commit 107497b2, authored by Claude Opus 4.5:

## Why MCP? (Architectural Decision)

### Deterministic vs Probabilistic Boundary

  "What equipment needs costing?"  → Claude  (judgment)
  "Find similar costs in database" → MCP     (math, reproducible)
  "Is 72% similarity good enough?" → Claude  (judgment)
  "Return top 5 matches"           → MCP     (exact k-NN)

### The Compounding Effect

- Every completed PCA adds vetted costs to the database
- Vector search finds them by meaning, not keywords
- After 100 projects, 90%+ matches HIGH confidence
- WebSearch becomes rare

Claude's context window resets every conversation.
The vector database persists forever.
MCP bridges that gap.

Four months later: 6 projects. 150 records.

I bought it completely. The reasoning was sound. The architecture was clean. The table made sense. Neither of us asked how many projects it would take to get to 100, or how big the database would actually get, or whether 150 records needed vector search at all.

Then, over the next few months, more functions got bolted on. Cost calculation. JSON file writes. Condition narrative merging. Template matching. Equipment storage. Database stats. Five tools became eleven, all in one file. Each one added because the server was already there and it was easier to add a case to the switch statement than to question whether the switch statement should exist.

The result was an 802-line monolith. Here’s what the core looks like, collapsed:

server.setRequestHandler(CallToolRequestSchema,
  async (request) => {
  const a = (args ?? {}) as Record<string, unknown>;
  switch (name) {
    case 'find_similar_costs':        // 30 lines
    case 'find_similar_equipment':    // 18 lines
    case 'find_similar_conditions':   // 17 lines
    case 'match_cost_template':       // 10 lines
    case 'store_equipment':           // 16 lines
    case 'get_database_stats':        // 14 lines
    case 'calculate_cost':            // 34 lines
    case 'merge_cost_data':           // 39 lines
    case 'merge_condition_narrative': // 33 lines
    case 'browse_cost_templates':     // 24 lines
    case 'get_cost_by_id':            // 31 lines
    case 'apply_cost_from_db':        // 101 lines
    default: throw new Error(`Unknown tool`);
  }
});

Twelve cases in one switch. Every handler starts the same way:

const query = a.query as string;
const limit = (a.limit as number) ?? 5;
const category = a.category as string | undefined;

That as string is not validation. It’s a promise to the compiler that the value is a string, with no check that it actually is. If an agent sends a number where a string is expected, TypeScript shrugs and the handler crashes at runtime. This is how bug #784 happened: the calculator rejected valid numeric inputs because nothing checked types before the handler saw them.

The whole thing requires a persistent process running an embedding model to do similarity search over 150 records.

It still worked. You just had to hold it a certain way.

What would good look like?

We had two open GitHub issues filed on different dates. Issue #986: “consolidate MCP cost tools, reduce 10-tool selection overhead.” Issue #968: “deprecate MCP server entirely, replace with direct file writes.” One said improve it. One said remove it.

I couldn’t decide, so we tried something. Instead of picking a side, we’d spec the best possible version first. Not a sketch. A real spec, detailed enough to hand to a plan-and-implement workflow. If the spec proved the thing was worth keeping, we’d have the blueprint ready to build. If not, the spec would still exist as the documented answer to “why not?” for anyone who asks later.

We cloned the Svelte team’s MCP server (sveltejs/ai-tools) and read every file. Their file tree:

handlers/tools/
  get-documentation.ts       # handler + registration
  svelte-autofixer.ts        # handler + registration
  svelte-autofixer.test.ts   # test, right next to it
  playground-link.ts         # handler + registration
  playground-link.test.ts    # test, right next to it
  list-sections.ts           # handler + registration
  handlers.ts                # re-exports for CLI use
  index.ts                   # re-exports registrations

Each tool is a file. Each file has a schema, a handler, and a registration function. A tool looks like this:

// schema: validates before handler sees it
const schema = v.object({
  code: v.string(),
  desired_svelte_version: v.pipe(
    v.union([v.string(), v.number()]),
    v.description('The desired svelte version...'),
  ),
});

// handler: pure function, testable alone
export async function svelte_autofixer_handler({
  code, ...
}) {
  // 90 lines of logic
}

// registration: wires handler to server
export function svelte_autofixer(server) {
  server.tool({
    name: 'svelte-autofixer',
    schema,
    outputSchema: autofixer_output_schema,
  }, async (input) => {
    return await svelte_autofixer_handler(input);
  });
}

Our file tree:

meap-mcp-server.ts              # everything

We spec’d what ours would look like rebuilt to the Svelte standard. Separated handlers from registration. Added Zod schemas. Structured output. Collapsed eleven tools to four. The spec would have fixed real bugs: the calculator rejecting valid inputs (no schema catching type mismatches), the primary write tool having zero test coverage (can’t test a case inside a switch block).

Claude wrote it up as a full design doc. File structure, dependency changes, migration path, which issues it would close. Ready to plan and implement.

Does even the ideal version earn its keep?

We had a spec we were genuinely ready to build. Zod schemas, handler separation, structured output, four clean tools. It would have fixed real bugs. I was mentally planning the implementation order.

Then I asked a different question. Not “how do we build this?” but “what does it look like if we just don’t have it?”

We decomposed each tool into whatever standalone thing it actually is. The cost calculator is a pure function with no state. It doesn’t need a server. It doesn’t need MCP. It’s bun cost-calc.ts --rate 100 --quantity 10, returning JSON on stdout. Same deterministic math, same checksum. The JSON file writes? jq handles reads and patches. Python scripts handle transforms. Both already existed in the project. The merger functions were already standalone modules in lib/. The MCP server was just calling them from inside a switch block. Validation? Three Python scripts already existed (validate-building.py, validate-template-ids.py, validate-wordbank.py). Agents could call them via Bash. They just… didn’t, because everything went through the MCP server instead.

We kept going down the list. Every tool decomposed into something simpler that already existed or could be a one-file CLI script. Ten of eleven tools didn’t need a server, didn’t need MCP, didn’t need a protocol layer. They needed to be callable.

The only thing that genuinely needed a persistent process was the vector similarity search. Loading the embedding model into memory takes 30 seconds. You don’t want to cold-start that on every query. That was the whole reason the MCP server existed as a running process in the first place.

So: does the vector search earn its keep?

150 records

We looked up what Andrej Karpathy has been saying about vector search and RAG. In April 2026, he published a GitHub gist describing an alternative: structured markdown files with an LLM-maintained index. His approach, he wrote, “avoids the need for embedding-based RAG infrastructure.” It works for datasets up to about 400,000 words, roughly 100 substantial source documents.

My dataset: 150 vetted cost records. 200 equipment comparables. Maybe 50,000 words total. Two orders of magnitude below Karpathy’s threshold. The entire cost database would fit in a context window. An agent could grep it.

We’d been running an embedding model as a persistent process, loading vectors into a SQLite database, querying via the MCP protocol over stdio transport, all to search a dataset that grep could handle.

Two paths to the same 150 records: the MCP pipeline (6 steps, embedding model, cosine similarity, SQLite, MCP stdio, JSON-RPC) vs grep (1 step, no server, no dependencies)

The zeitgeist problem

Here’s the part I keep thinking about. When I asked Claude “what should the ideal MCP look like,” its first instinct was to expand the server. Add wordbank lookup tools. Add JSON query tools. Add validation tools. Go from eleven tools to fifteen. The AI wanted to make the Swiss army knife bigger.

Late 2025 was peak “put everything behind MCP tools” energy, and that’s exactly when we built this. Claude’s training data from that period is full of MCP tutorials, architecture diagrams with MCP in the middle, blog posts about building your first MCP server. When you ask “how should I structure this,” the AI reaches for what it saw most during training. Not what fits your problem.

I had to explicitly break the frame. “I was looking at removing it entirely. Tell me why we even need an MCP to do any of these things.”

“How do we improve this” produces better versions of the existing thing. “Why does this need to exist” is a different question. Once you ask it, you notice that most of what the server did was wrap file operations in a protocol layer that added complexity without adding capability.

The barnacles

The server itself is 802 lines. Replaceable in an afternoon. But everything that grew around it is harder.

The cost agent’s prompt is 800 lines long. Forty of those lines are a step-by-step MCP tool call sequence: “call browse_cost_templates with query = equipment description, the tool returns candidate records WITHOUT dollar amounts, call apply_cost_from_db with the selected record_id, the server handles all DB lookup, inflation normalization, and calculation atomically.” That whole section has to be rewritten to say “run this CLI tool” instead. The project’s CLAUDE.md has three separate gotchas about MCP behavior: test database paths, server restarts polluting stdout, tool name collisions across specs. Agent changelogs track MCP-specific bug fixes going back to January.

While writing this post, another session was deploying an unrelated server update. The deploy process touched the MCP server’s vector database binary, which silently grew by 8KB. The review gate caught it: “MCP indexing touched the binary DB. Not a code change, restore it and proceed.” The session had to git checkout the file back to clean the working tree before continuing. A deploy for code that has nothing to do with the vector database, interrupted by the vector database.

If someone had asked “do we need this?” in February, the teardown would have been trivial. The longer vibe-coded infrastructure runs without being questioned, the more downstream code wraps around it. Each piece is small. Together they’re a real migration.

The spec still sits in dev/superpowers/specs/. If the dataset grows to 10,000 records and grep stops working, the blueprint is there. If someone asks why we don’t use an MCP server, we hand them the spec and the decomposition. You never lose the work. You just use the spec as a mirror instead of a blueprint.

The Swiss army knife you have to hold a certain way#

What would good look like?#

Does even the ideal version earn its keep?#

150 records#

The zeitgeist problem#

The barnacles#