Spec-to-Kill, Forward

Written with Claude.

Two workbenches side by side: a Frankenstein’d contraption on the left, a clean blueprint on the right. The spec-to-kill method, reversed.

The direction I wasn’t using

Last week I wrote about the spec-to-kill method. Short version: I had an MCP server I’d built six months earlier, cargo-culted into eleven tools, genuinely useful but Rube Goldberg. I couldn’t decide whether to clean it up or rip it out. So Claude and I specced the best possible version (handler separation, schema validation, four clean tools), and then I asked whether even the ideal version earned its keep. The answer was no. grep handled the dataset. The spec became the teardown blueprint.

The method has a direction baked in: spec what exists in its ideal form, then decide whether to keep it.

This week I was on the other side of the problem. I have a whole platform (a thing I’ve been building for eighteen months), and I don’t know whether it should stay what it is.

The Frankenstein

The platform is a field data collection tool for property condition assessments. Engineer walks a property with a tablet, captures equipment data, AI agents write cost estimates and condition narratives, reviewers approve in dashboards, the system spits out a report. It works. I use it on real projects for paying clients.

It’s also a Frankenstein. Eight hundred sessions of pair-programming with Claude (me directing, Claude drafting and implementing, both of us iterating), most of them while I was simultaneously learning the Claude Code ecosystem itself. The runtime runs inside an interactive LLM IDE session. Cross-machine sync goes over Syncthing. State lives as files in directories. Prompts are embedded in skill-file frontmatter. Twenty of the twenty-four documented gotchas in the project’s README trace to substrate choices that work for one user and would not generalize.

This post was written the same way. Cyborg mode, to borrow the term from Ethan Mollick: me and Claude weaving through the same work. “I” below is usually me making a call inside the pair; “we” is both of us; Claude gets a name when a specific move is worth pointing out.

The question had been sitting there for months: is this a power tool for one person that should stay a power tool for one person? Or is it the germ of something durable that a second engineer could read on a Monday and be contributing to by Friday?

I didn’t know. Arguing it abstractly was going in circles. So I ran the spec-to-kill move, but forward: spec the durable version. The one with a headless runtime, a database as system of record, schema-enforced writes, multi-tenancy, and observable AI runs. The version that passes the bus-factor test. If the spec is genuinely better than what exists, I can decide whether to build it. If it’s not, the spec is the documented argument for why the Frankenstein is fine.

The first draft

We started writing. Three files in a dev/next-gen/ folder: a README, a design doc, a build doc. Draft after draft.

The first draft was riddled with references to the prototype. It used the prototype’s vocabulary: building.json, wordbank, discipline, subsection, review dashboard, cost_scope. It cited the prototype’s ADRs by number. It compared to /a-cost, the current command that runs the cost agent. It explained design decisions as reactions to the prototype’s failure modes.

I handed a mental copy to the hypothetical second engineer the README said it was written for. Paragraph one mentioned “meap2-it” (the prototype’s working-title repo name) and “building.json.” Neither was defined. The stranger gave up before reaching the architecture section.

The design thinking was right. The writing was wrong. We were cargo-culting my own prototype’s vocabulary into the spec of its replacement.

“Don’t even mention meap”

I pushed back. Full standalone. Don’t mention the prototype at all. Use the industry standard. This is useful for any building system, not just mechanical-electrical-plumbing. The platform should be described on its own terms.

The rewrite rule was exact: zero references to the prototype. No ADR numbers. No command names. No internal jargon. The domain had to come from somewhere a cold reader could look up independently.

That somewhere was ASTM E2018.

The standard does the work

ASTM E2018 is the industry standard for Property Condition Assessments. It defines the vocabulary: Immediate Repairs for items needing work within twelve months. Replacement Reserves for capital items within the Evaluation Term (typically ten to twelve years). Opinion of Probable Cost for each item. Property Condition Report as the deliverable. The User (the party receiving the report: buyer, lender, owner, investor) distinct from the assessing firm’s staff.

My prototype had its own words for all of these. Most engineering firms have their own internal words. The standard has one set that every professional in the industry has agreed to, or at least can look up.

The design doc’s opening section became the ASTM E2018 primer. A reader who has never touched my prototype gets the domain in paragraph one:

A Property Condition Assessment (PCA) is a formal survey of a property’s physical condition. Performed to ASTM E2018, a baseline PCA results in a Property Condition Report (PCR) delivered to the User that documents existing physical deficiencies requiring Immediate Repair (within ~12 months), capital items requiring Replacement Reserves within the Evaluation Term (typically 10-12 years), and an Opinion of Probable Cost for each deficiency and reserve item.

Every architectural section downstream refers back to that vocabulary. The domain model uses it. The workflow table uses it. The cost agent’s structured output uses category: immediate | reserve | excluded and year_of_action in the evaluation term. The report renderer generates a PCR against the ASTM E2018 section outline.

Maybe fifty lines of inline definitions compressed into a two-word citation. I did not even do that work. The standard did it.

Where the docs lived mattered as much as what they said.

Location as governance

The final docs live at dev/next-gen/. Not dev/specs/, which is auto-refreshed from code and describes what exists. Not dev/decisions/, which is where real ADRs go. The directory name does governance work: this is forward-looking, not current-state, and not decided.

Three files:

dev/next-gen/
├── README.md     ( 22 lines)  entry + status + contents
├── design.md     (128 lines)  domain, workflow, architecture, principles
└── build.md      (195 lines)  layout, libraries, code sketch, patterns
                  ---
                  345 lines

Each file leads with the same line:

Status: Exploratory working notes — not a committed plan.

Not in the footer. Not as a note at the end. The first line a reader sees.

Zero occurrences of the prototype’s name across all three files:

$ grep -ic "meap" dev/next-gen/*.md
dev/next-gen/build.md:0
dev/next-gen/design.md:0
dev/next-gen/README.md:0

That wasn’t cosmetic. It forced every architectural claim to stand on its own. The principles the prototype taught (runtime separate from authoring, database as system of record, schema-enforced writes, multi-tenancy, observable AI runs) all survived the rewrite. They’re expressed as general architectural commitments, not as reactions to specific historical gotchas.

The human stays at the gates

One piece of the spec got very specific. It’s worth pulling out, because the default for most professional-services work with LLMs right now is somewhere much worse.

Ethan Mollick names three modes of working with AI. Centaurs partition tasks: AI does some stages, humans do others, clean handoffs. Cyborgs weave through the same work together, sentence by sentence. Self-automators offload: hand the AI a task, take the output, make minor edits, paste it into the deliverable, ship. Mollick’s research partner on the framing, Fabrizio Dell’Acqua, has called self-automation “abdicated co-creation.” Quick, polished results, no skill development, no real custody of the work.

Most professional-services teams using LLMs today are self-automators. They open a chatbot, paste in a fragment, get a response, copy what’s useful, paste it into the real document, go ask the chatbot the next question. Maybe they hand it a file and ask it to look. But the output is always a chat, something whose natural end is the chat window. It doesn’t flow anywhere. It has to be mentally integrated or copy-pasted. Every step is human plumbing. The AI is a very fast intern who can’t actually touch the artifacts the deliverable is made of.

The prototype is not self-automation. The state files are the connective tissue. The LLM reads building.json, works over it across multiple sessions, and writes back in structured form that the next stage can consume. Cost opinions, narrative cards, photo intelligence. All flow through the same file-based state, not through chat windows. I run the commands and approve the outputs, but I’m not manually moving data between stages. The files do that work.

I am the gate and the initiator. Not the connective tissue. That’s the crucial distinction.

In the taxonomy: the production pipeline is centaur. Discrete AI stages, human gates at every handoff. The authoring pair where Claude and I built the platform (and this post) is cyborg. Two modes at different layers, both deliberate. Neither is self-automation.

But it’s also primitive. Every stage runs inside a Claude Code session. State lives in files on disk. Approvals are implicit: I see the output in the terminal and decide whether to re-run or move on. There’s no formal review dashboard, no database, no audit log, no multi-user anything. It’s a power tool that works because the state files are well-designed, not because the infrastructure is.

The successor keeps the gate model and makes it non-primitive. The pipeline runs without a Claude Code session around it. State lives in Postgres with an audit log. Approvals are explicit: a reviewer opens a dashboard, sees what the pipeline produced at capture (most important), at data processing, at cost generation, at report generation, and approves or revises row by row. Same gate model. Proper receipts.

I wrote about the three misses last month. The three ways an AI pipeline can land the wrong answer, and why every stage has to constrain plausibility. Human gates are how you make that constraint real. Each gate IS the receipt: proof that a specific person signed off on that stage’s output and now takes responsibility for it. Without the gate, the chain breaks and the AI fills the gap with plausible fiction. With the gate, every number and every sentence in the final report has a name and a timestamp attached to it somewhere upstream.

Capture is the most important one. If a garbage observation enters the pipeline, no downstream process removes it. They polish it, which is worse. The assessor has to approve what gets captured before it moves. Every downstream gate inherits that constraint.

That’s the quality argument. There’s a second argument, and it’s just as load-bearing.

This is professional services. A PCR is a document a client pays an engineering or architectural firm to stand behind: the firm whose license is on the line, the firm that can be sued for what ships in the pages, the firm whose reputation rides on the signature block. There is no way around a human taking responsibility for the output. There shouldn’t be. The gates are where that responsibility transfers from the pipeline to the person who signs.

This is how to do LLM in a professional setting: not as an autonomous replacement for the human, but as a tool the human keeps approving as it runs.

What I still don’t know

The spec is done. It’s good. A second engineer could read it on a Monday.

I still don’t know whether I’ll build it.

Maybe the power tool is the answer. The platform stays what it is (primitive but working, files as the connective tissue, me at the gates between Claude Code sessions), and the version-next energy goes into making the one-person version sharper instead of making a many-person version exist. The 24 gotchas would still be there. Syncthing would still be Syncthing. The power tool would just be better.

Maybe the successor is worth building. Same gate model, non-primitive infrastructure: pipeline autonomous, Postgres with audit log, review dashboards instead of terminal approvals, multi-tenant, observable. Something that could survive a month off.

I don’t know which.

The spec exists so that the question becomes answerable instead of arguable. That’s the forward direction of spec-to-kill. You don’t spec to decide whether to kill the thing you have. You spec to decide whether to build the thing you don’t.

You never lose the work. If I don’t build it, the spec is the documented argument for why not. If I do, the blueprint is ready. If someone asks a year from now why a single-user power tool isn’t also a multi-tenant SaaS, I hand them the spec.

Same method. Different direction. See what you actually think.

The direction I wasn’t using#

The Frankenstein#

The first draft#

“Don’t even mention meap”#

The standard does the work#

Location as governance#

The human stays at the gates#

What I still don’t know#