Same Architecture, Different Verdicts: Eight Signals for Scale-Contingent Design

Written with Claude.

A wood-and-brass hand plane on the left labeled ‘single firm’, a networked LCD-and-antenna power tool on the right labeled ‘fleet’

Writing the Counterfactual Made the Answer Obvious

The problem we were solving is simple to state: every piece of equipment in an engineering assessment report needs a defensible replacement cost. Where did that number come from? How confident are we in it? Could a client or a peer reviewer trace it back to a source? Answering those questions for hundreds of items across dozens of projects is the whole job. The system we built to do it had one sentence at its core: put the correct number into a file safely.

“Correct” is the hard half — sourced from the right place, rated for confidence, traceable to a specific vendor quote or database record or web search. “Safely” is the easy half — written atomically so a crash mid-write doesn’t corrupt the file. We used MCP (Model Context Protocol, a standard for giving AI agents access to external tools via a server process) to build that system, exposing twelve tools to a cost agent so it could look up records, run calculations, and write results back to the project file.

Four months later we tore it down.

I was adding a fleet-platform counterfactual to the retrospective on Sunday when the thing I had been trying to articulate for a week finally came into focus. The question had been nagging since the teardown: was the MCP server wrong in principle, or just wrong at our scale? I kept drafting one version of the answer, and Claude kept producing a more polished version of the same draft, and neither of us was landing it.

What finally worked was asking Claude to explore the alternate reality directly — not “explain why the MCP server failed” but “describe the conditions under which this exact architecture would be the right call, in as much detail as you can,” which produced a fleet platform spec with thirty tools, persistent vector indices, multi-tenant ACL enforcement, and operating costs in the thousands per month, and reading that spec against the actual system made the scale mismatch obvious in a way that no amount of direct analysis had.

This post is about that second-order question, and the title names the finding directly: a fleet engineering platform and a single-firm tool can use the exact same protocol family — same JSON-RPC transport, same tool-call shape, same MCP SDK — and one of them is right to build it and the other is not, because the verdict is not about the protocol, it is about the scale context the protocol is being asked to serve. The Spec-to-Kill Method covers the original verdict in detail. That post ends with the teardown decision. This one starts with the question the teardown raised: under what conditions would we have been right to build it?

The answer has eight signals. Each one flips at scale.

What We Built, and What It Needed to Be

The MEAP MCP server ran from 2025-12-22 to 2026-04-19 as a per-project daemon exposing twelve tools to the cost agent. Here is what we actually shipped:

Per-client-project process tree

├── Claude Code session (one per project)
│   └── stdio JSON-RPC transport
│       ▼
│   meap-mcp-server.ts (802 lines)
│       ├──> @modelcontextprotocol/sdk
│       ├──> @huggingface/transformers  (~50MB warm)
│       ├──> sqlite-vec                 (vector index)
│       ├──> equipmentVectorDB.ts       (915 lines)
│       ├──> indexers (3 files)
│       ├──> cost-calculator.ts
│       ├──> building-merger.ts
│       └──> ppi-staleness.ts

└── Caller-side burden
    ├── Tool selection across 4 overlapping search tools
    ├── "Process groups sequentially. Concurrent MCP
    │    writes cause data loss." (in every skill)
    ├── Number-as-string handling in JSON-RPC (#784)
    └── Uncalibrated HIGH/MEDIUM/LOW confidence labels

Here is what it needed to be:

Per-call invocation

├── bun cost-calc.ts    (Zod-validated math, 71 lines)
├── bun cost-write.ts   (parallel-safe write, 24 lines)
├── bun cost-lookup.ts  (record search, 107 lines)
└── grep -i {term} *.yaml  (181 records, ~10,000 words)

The complexity bill came down to this: 2,900 lines of server code to 202 lines of CLIs, four runtime dependencies to two, one persistent process per active project to zero — a reduction documented in detail in the project retrospective written across the teardown sessions.

The reduction looks like a verdict. It is only half the story.

What Changes at Fleet Scale

The other half required writing out the universe where none of those simplifications hold.

At single-firm scale, the cost agent is asking: what did we pay to replace a comparable boiler in Oakland last year? It is asking that question against 181 records on a local YAML file, one project at a time, with one engineer reviewing the output. The server that serves it is a per-project daemon on a developer’s workstation. If it crashes, you restart it. If it’s slow, you wait. There is no other tenant, no concurrent agent, no audit authority beyond the engineer reading the report.

The fleet platform is a different question entirely. Thirty-plus engineers from multiple firms, 200 active projects in flight, cost records shared across the firm’s archive and live vendor feeds, a supervisor coordinating a dozen concurrent sub-agents that cannot be allowed to write conflicting costs to the same building at the same time. The tool that answers “what did we pay for a comparable boiler?” in that context needs warm state across calls (re-indexing 50,000 records on every call is not an option), streaming output (vendor feeds are live), coordination primitives (two agents writing the same field is a data integrity problem, not a nuisance), and ACL enforcement at the gateway (Firm A cannot see Firm B’s negotiated rates). Those are fundamentally different problems from “grep a YAML file and format the result,” and they require a server the way that a distributed transaction log requires a server — not as an architectural preference, but because the problem has no stateless solution.

Here is the comparison that made the answer obvious:

Dimension	Single-firm tool	Fleet platform
Engineers	1 to a handful	30+ across an MEP firm or several firms
Active projects	5 to 20	200+ in flight at any time
Historical projects	Local Box mount	1,500+ across the firm’s archive
Cost-record corpus	181 vetted records (~10,000 words)	50,000+ records plus live vendor feeds
Equipment instances	A few hundred ever	2,000,000+ across the project fleet
Concurrent agents	1 cost agent at a time	Dozens plus a supervisor
Tenancy	One firm, one user	Multiple firms, ACL-filtered
Audit retention	Box file versioning	Seven years, queryable, replay-safe
Operating cost	$0/month plus LLM API	Thousands/month: cloud infra, vendor APIs

The fleet platform needs a gateway, three persistent intelligence services, an audit firehose, and thirty tools: not twelve. Each of those thirty tools is stateful, streaming, or coordinative — meaning they maintain data across calls, push updates as they arrive, or enforce ordering between concurrent operations. None of them are stateless file operations dressed up as RPC, which is what the single-firm design has. The single-firm design has zero of those requirements today.

Same protocol family. Different verdicts. The question is which side of that table you are on.

Eight Signals That Tell You Which Side You Are On

Writing the retrospective and the counterfactual together surfaced eight patterns. Here they are compressed to their diagnostic core, with the practical consequence of each:

#	Pattern	Single-firm signal	Fleet signal	So what?
1	Server abstraction without state	11 of 12 tools are stateless file I/O	Most tools hold warm state across calls	If your tools are stateless, the server is overhead: you are paying for a process, a protocol, and a dependency graph to wrap functions that could be CLI calls.
2	Overlapping tool surface	4 search tools answering the same question	Tool groups cover distinct problem surfaces	Overlap means agents waste context selecting among equivalent options. Fleet tools earn their count by covering distinct problems.
3	Concurrency in prose, not code	Skills say “process sequentially”	Coordination is code-enforced	Prose warnings break silently. If coordination matters at the scale you are operating at, it belongs in the code, not in the prompt.
4	Uncalibrated confidence labels	HIGH/MEDIUM/LOW from cosine distance, unvalidated	Labels calibrated against domain outcomes	Labels that aren’t grounded in outcomes are decorations. At fleet scale, miscalibrated confidence propagates to hundreds of reports. At single-firm scale, the engineer catches it.
5	Hide-from-LLM vs prove-to-auditor	“Cannot see” when you need “cannot forge”	Cryptographic receipts; visibility separate from trust	Ask what the actual requirement is. If it is auditability, receipts achieve that without a daemon. If it is genuinely preventing the model from seeing a value, a server earns its cost.
6	Embeddings below threshold	~10,000-word corpus; Karpathy 2026 floor ~400k words	50,000+ records; embeddings earn their weight	Below the threshold, grep on a text file beats a vector index: no embedding model to maintain, no warm daemon, no 50MB startup.
7	Spec-to-kill before spec-to-build	“Should we consolidate?” loops without a spec	Write the best version first, then decide	Writing the spec for the tool you might build forces you to name what it would need to do. Often the list is short and a CLI covers it.
8	Specialization vs duplication	Two mergers sharing shape, different treatment	Share shape, specialize treatment; ask which before consolidating	When two implementations look similar, the reflex is to consolidate. Ask whether they share a trust model, not just a shape. Different trust, different code, even if the traversal is shared.

Pattern 1 is the load-bearing one for us. Eleven of twelve tools were stateless file operations on YAML and JSON files. The twelfth held a vector index over a corpus of roughly 10,000 words, well below the threshold where embeddings add value over grep. Everything else on the list above is downstream of that mismatch.

Pattern 5 deserves its own paragraph because we got it wrong explicitly, not just structurally. An early architectural decision formalized the server’s purpose as hiding dollar values from the cost agent. The actual requirement was something narrower: every cost in every engineering report needed to be reproducible from artifacts we could point to. “Cannot be forged” and “cannot be seen” are different properties. The server delivered both, at the cost of a server. The CLI replacement delivers “cannot be forged” through cryptographic receipts while letting the agent see the values. The teardown preserved every load-bearing property from that original intent; it just dropped the mechanism that required running a daemon.

The quote from the retrospective that I keep coming back to:

“when you hear ‘hide X from the LLM,’ ask whether the actual requirement is ‘cannot be forged’ rather than ‘cannot be seen.’ Those are different properties with different solutions.”

Pattern 6 is the one I expect to apply most often, outside MEAP. Karpathy’s April 2026 analysis put the practical floor for embedding-based retrieval at roughly 400,000 words of corpus — below that, grep on sorted text returns comparable recall with no model overhead, no warm daemon, and no startup cost. That number is a rule of thumb drawn from retrieval benchmarks, not a precise threshold, but it is a useful first filter before reaching for the vector stack. The MEAP cost database was 181 records totaling roughly 10,000 words. We had built for a recall problem we did not have.

The One Pattern That Took the Most Pressure to Surface

Pattern 8 did not come from the MCP design itself. It emerged mid-retrospective, when the first draft noted that two building-merger.ts implementations existed in the repo and flagged them for consolidation. Claude drafted the consolidation argument. I pushed back. The two mergers share a traversal shape but handle different trust models: engineer-authored edits versus agent-authored edits are different problems, and each implementation is simpler as a result of being specialized.

The retrospective records this directly. The agent’s self-reflection produced the wrong consolidation reflex. The pressure test produced the right answer: same shape, different treatment, share the primitive, keep the specialized callers. The pattern is in Part 5 of the retro because it genuinely generalizes, but it only made it into the retro because of a specific exchange where the first framing was wrong.

That is a useful data point about where the actual work sits in AI-assisted development.

The Architecture That Would Have Been Right

The fleet platform spec is not a committed direction. It is an exploratory counterfactual, filed alongside the single-firm design to keep the chosen direction a deliberate decision rather than a default. The point of writing it was exactly what it delivered: a clear picture of the conditions under which the MCP architecture earns its cost.

Thirty stateful tools. Persistent vector indices over 50,000-record corpora. Cross-process coordination with explicit claim resolution. Multi-tenant ACL enforcement at the gateway. Seven-year queryable audit retention. Operating costs in the thousands per month rather than zero.

None of those conditions exist today. All of them are plausible within five years if the business grows in the directions it is currently heading. When any of the five “when to reconsider” conditions become true — corpus crossing 50,000 records, second firm wanting in, concurrent agents exceeding single-process coordination, live client portal as a deliverable, vendor API integration becoming core — the fleet doc is the spec to revisit.

Until then, it is the universe we are not in.

The “Speccing the Successor” post, currently queued, covers what the next-gen single-firm design actually looks like. This post was the middle step: not the kill verdict, not the successor design, but the diagnostic for why the same protocol can be four months of liability in one context and the only correct answer in another. The eight patterns are the diagnostic. The scale table is where you read off which column you are in.

Writing the Counterfactual Made the Answer Obvious#

What We Built, and What It Needed to Be#

What Changes at Fleet Scale#

Eight Signals That Tell You Which Side You Are On#

The One Pattern That Took the Most Pressure to Surface#

The Architecture That Would Have Been Right#