Written with Claude. The entire process described here — the spec generation, the triage conversation, and this blog post — was done in one Claude Code session.
The Wall of Findings
I have a home network infrastructure repo. Raspberry Pi running Pi-hole, Caddy, WireGuard, CrowdSec. About 20 scripts, a Go TUI dashboard, some systemd units, a security dashboard. The kind of thing that accumulates config and code over 280+ sessions of Claude Code work.
I hadn’t refreshed the specs in six weeks. So I ran /spec refresh ALL.
What came back: 73 findings. 16 marked HIGH severity. Five categories. And my first reaction was the same one anyone would have looking at a wall of red.
That’s a lot of problems for a home network.
How the Spec Pipeline Works
The pipeline runs in four waves. Each wave’s output feeds the next. All agents within a wave run in parallel.
Wave 1 — Domain Tracers. Four agents, one per domain. Each gets a prompt like this:
Trace the 'scripts' domain comprehensively.
Entry points: scripts/
Produce a COMPLETE spec covering:
- Overview, Architecture, Data Model, API Surface,
Configuration, Dependencies, Edge Cases
- Cite specific file paths for every claim
Write to: dev/specs/scripts.md
The agent reads every file in its domain, follows dependency chains, and produces a standalone specification. It catches things you wouldn’t find without comparing what’s in git to what’s actually running. The scripts tracer, for example, read wg-logger.sh line 20 and saw it writes directly to NFS via append redirect. Then it read the memory bank notes about Session 75, where that same script was patched on the live server to use syslog instead. The repo was never updated. That kind of drift — where production and git quietly disagree — is exactly what these agents are good at surfacing.
Wave 2 — Auditors. Two agents in parallel that check Wave 1’s work. One looks for contradictions between the four specs. The other fact-checks claims against the actual code. For example, one spec said “harden-ssh.sh uses $MAC_MINI from network.env.” The auditor grepped the script, got zero matches, and flagged it. The spec was wrong — that variable doesn’t appear anywhere in the file.
Wave 3 — Synthesizers. At this point we have four domain specs (every file, config, dependency, and edge case mapped out) and two audit reports (contradictions flagged, factual claims verified against actual code). Wave 3 reads all of that and produces two cross-cutting documents: a data dictionary (every config struct, log format, and env var in one place) and a set of system invariants — rules that must always hold, like “Caddy env vars must exist at parse time, not just at runtime.” Thirty-seven invariants total.
Wave 4 — Findings Analysts. Five agents in parallel. Here’s the full prompt for the security analyst:
This is a REFRESH, not initial generation. Produce a comprehensive,
standalone findings document. Do not reference or diff against the
previous version — generate from scratch using current codebase state.
Analyze this codebase for security issues.
Reference these specs for context:
- dev/specs/scripts.md
- dev/specs/tui-dashboard.md
- dev/specs/service-configs.md
- dev/specs/audit-commands.md
- dev/specs/cross-reference-audit.md
- dev/specs/validation-report.md
- dev/specs/data-dictionary.md
- dev/specs/invariants.md
Write to: dev/specs/findings/security.md
The actual instruction is seven words: “Analyze this codebase for security issues.” That’s it. No checklist of vulnerability categories. No “make sure you check for XSS.” The agent is a specialized findings-analyst subagent — it already knows what to look for. Overconstraining it with a specific list would make it stop looking for things you didn’t think to list.
What does the work is context. By the time this agent runs, three previous waves have already mapped every file, found contradictions between specs, validated claims against actual code, and extracted 37 system invariants. The analyst doesn’t need detailed instructions. It needs detailed context. The waves provide that.
Five copies of this prompt, one each for security, operational risk, technical debt, dead code, and documentation drift. Same context, different lens.
Thirteen agents total. About 25 minutes wall time. The output: four domain specs, two audit reports, two synthesis documents, five findings reports. And those 73 findings.
The Severity Label Problem
The security agent flagged NFS no_root_squash as HIGH. Technically correct. If someone compromises one Pi, they can plant malicious files on the shared network drive that the other Pi might execute.
On a home network. Two Pis, one admin, no untrusted users on the LAN.
The agent can’t know that context. It sees a vulnerability pattern and labels it HIGH. Because by the textbook, it is.
Same story with seven admin services exposed through Caddy without proxy-layer auth. HIGH. Sounds terrible. But each app has its own authentication. Pi-hole has a password. Home Assistant has a login. The Synology NAS has 2FA. The “missing” auth layer is defense-in-depth that we’ve already decided to address with Authelia, whenever that gets prioritized.
The agent does what it’s supposed to do. The problem isn’t the agent. The problem is what happens after.
Triage: Using the Wrong Tool the Right Way
I have a brainstorming skill in Claude Code — a reusable prompt designed for collaborative solution design. You bring it a problem, it asks clarifying questions one at a time, proposes approaches, and helps you converge on a design.
I brought it 73 findings.
It tried to help me solve them. Proposed grouping strategies. Hedged about scope. Offered options: fix the easy ones now, defer the hard ones, file issues for later.
I redirected it. We’re not solving anything. We’re classifying.
Every finding goes into exactly one bucket:
- Real issue — needs a GitHub issue filed
- Accepted architecture decision — we know about it, we’ve decided it’s fine, here’s why
- Not planned — technically imperfect, practically irrelevant
- Duplicate — same finding restated in a different category
- Already tracked — existing GitHub issue covers it
- Resolved by deprecation — the TUI is getting retired, this goes with it
No “we’ll look at it later.” No “probably fine.” Every finding gets a decision with reasoning attached.
The brainstorming skill’s one-at-a-time discipline turned out to be perfect for this. It presented each category as a table. I made a call on each row. When it classified something as “not planned” and I wasn’t sure, I’d push back — “are you sure about that?” — and it would either defend the reasoning or reclassify. When I said “add an ELI5 column,” it restructured every table on the spot.
Here’s what the actual triage conversation looked like:
| Finding | ELI5 | Decision |
|---|---|---|
| 7 services without proxy-layer auth | Pi-hole, NAS, HA on public internet with only their own login | Accepted — Authelia #86 is the fix |
| NFS no_root_squash | If one Pi gets hacked, it can plant traps on the shared drive | Accepted — LAN-only, documented risk |
| curl -sk skips TLS for UniFi API | Router API key sent without verifying who we’re talking to | Real issue — easy cert pin |
Finding by finding. No skipping.
The ELI5 Column
I asked for the ELI5 column halfway through the security section. Once every finding had to be explained in plain English, the decisions got easier.
“SEC-M03: Secrets file has no enforced permission check” becomes “the file with API keys doesn’t verify it’s locked down (chmod 600).” Single-user machines. Not planned.
“TD-012: dietpi-backup-wrapper free space check uses integer division” becomes “the backup script rounds 29.9 GB to 29 instead of 30.” The bug makes the threshold more conservative, not less. NAS has 800 GB free. Not planned.
“OR-H-002: wg-logger.sh repo version hangs on NFS unavailability” becomes “the version in git has a bug we already fixed on the server. Deploying from git re-breaks it.” Real issue.
If you can’t explain why a finding matters in one sentence, you probably don’t know whether it matters. Several HIGH-severity findings deflated the moment they had to be stated simply.
I also pushed Claude to defend its “not planned” classifications:
“are you sure about these?”
It re-examined each one. The backup integer truncation: “the bug is in the safe direction — the effective threshold is 31 GB instead of 30 GB, and the NAS has 800 GB free.” The one-off SSH hardening script: “both machines are already hardened, this script has been run exactly once and its job is finished.” The GoAccess double-logging: “10 extra journal lines per hour, fixing it risks breaking rsyslog routing to NFS.”
All three held up. But the pressure test mattered. If you can’t defend “not planned” when challenged, it’s not “not planned” — it’s avoidance.
One Decision, Twenty Findings
The TUI was our second ever Go TUI. A learning project. Bubbletea framework, SSH health checks, a dashboard that polled services every 30 seconds. It worked. We learned a lot building it.
It also hadn’t been touched in months. And the findings reflected that. Twenty of them, spread across every category: unused functions nobody calls, a chat feature that was built but never connected, version numbers frozen since 2025, documentation describing code that had already changed. None of them were urgent on their own. All of them pointed at the same thing: this code isn’t maintained, it’s just sitting there accumulating concerns.
As we marched through findings, the “resolved by TUI deprecation” bucket kept growing. Security. Dead code. Tech debt. Documentation drift. By the time we’d gone through three categories, the decision was obvious. Not because any single finding demanded it. Because 20 findings across every category told the same story.
One architectural decision. Twenty findings resolved.
The Root Cause Pattern
Four of the highest-severity findings were all the same problem.
The Caddyfile in git is missing the Audiobookshelf site. The wg-logger script in git has a known NFS hang that was fixed on the live server. The GoAccess service unit in git is missing a circuit breaker that was added on the server. Three CrowdSec config changes from last session only exist on the Pi.
Four findings. One root cause: we fix things on the server and forget to save them back to git.
Seeing that pattern led to a new issue that none of the agents suggested: write a fetch-configs.sh / deploy-configs.sh script pair to make the manual sync less error-prone. No individual finding called for it. The pattern across findings did. That’s something a severity label on a single finding can never show you.
What Came Out the Other Side
73 findings. 15 real issues.
| Disposition | Count |
|---|---|
| Real issues | 15 |
| TUI deprecation | ~20 |
| Duplicates | ~18 |
| Accepted decisions | 8 |
| Not planned | 8 |
| Already tracked | 2 |
None of this required writing code. The spec refresh ran for ~25 minutes while I did other things. The triage was ~20 minutes of conversation. The output was 15 GitHub issues, each with context and acceptance criteria, plus a documented reason for every finding we chose not to fix.
That’s the part that matters most to me. Not the 15 issues — those are just tickets. It’s the 58 findings we didn’t file. Each one has a reason attached. “Accepted — LAN-only, documented risk.” “Not planned — bug is in the safe direction.” “Resolved by TUI deprecation.” Six months from now, when one of those findings comes up again, the reasoning is already written down.
“Correct finding” and “worth fixing” are different questions. The agents handle the first one. You handle the second.
Written with Claude.