The Bot With No Browser

Written with Claude.

Hero image: a CRT security monitor in a dark room showing four empty hallway feeds, all labeled ALL CLEAR, with a sleeping night guard reflected faintly in the glass

I was about to build the wrong filter

I have a lot of GitHub issues on my field-service app. A hundred and sixty-one open, mostly written by me, about my own tool. Most of them I still want to fix. Some of them became stale before I got to them.

I’d been working on a planner script that groups related issues into epic buckets, so I can attack a theme instead of chewing through the log one bug at a time. The planner was about to get a new pre-pass: drop any issue the nightly crawler bot had already marked as “already fixed,” then cluster what’s left. This felt obviously correct. Why waste tokens re-investigating issues a bot has already told me are done?

Before I built it, I asked another Claude session in the repo that owns the crawler what it thought. Did this pre-pass duplicate something? Was there an existing skill I should use instead? Standard expert-consultation stuff.

The consultation came back with a finding I didn’t want but couldn’t ignore. The crawler had marked 28 issues in my repo as crawler:verified-fixed. My expert picked four of them at random and checked each against the actual code. Four for four, still live bugs.

Reading the list of false positives

Once I knew to look, the pattern was obvious. The crawler had marked these as fixed, among others:

“Component chips should be collapsed at initial view, too many to pick from.”
“Tile progress: only not-present items go green, very hard to get greens.”
“Sync button obscured by building name on mobile.”
“Notes text field too small for extended field notes on mobile.”

These are all visual claims. They’re about how the app looks when you use it. “Too many to pick from” is a perceptual judgment. “Obscured by building name on mobile” requires rendering the thing at a 411-pixel-wide viewport and looking. “Text field too small” is a finger-on-screen thing.

The crawler’s investigator agent has four tools: Read, Grep, Glob, Bash. It can read files. It can grep for strings. It can list directories. It can run shell commands. It does not have a browser. It cannot render HTML. It cannot measure layout. It does not know what the app looks like on a phone, or on any screen, ever.

So when it got asked “is this sync-button-obscured-by-building-name bug fixed?”, what did it do? It found code. The code looked plausible. It labeled the issue verified-fixed.

The agent confidently labeled things it couldn’t see

The labels are wrong. That’s the first thing. Twenty-eight of them on my repo right now, labeling live bugs as fixed. The wrongness has a specific shape that’s worth naming, but naming the shape doesn’t make the labels less wrong.

The agent wasn’t making things up from nothing. It wasn’t ignoring its prompt. Given a code-grounded bug (“the migration script silently drops rows when the source table has a nullable primary key”), the agent can do real work. It can find the migration. Read it. Check whether the nullable case is handled. Confirm a fix commit. Produce a file:line citation.

Given a UI claim (“the chips are too many to pick from”), the agent has no evidence to gather. “Unable to verify” is not a response shape the agent’s prompt encoded. So it pattern-matched to the nearest thing it could say, which is “I looked at the code and it looks fine.” That came out as verified-fixed.

The agent’s output confidence wasn’t bounded by its evidence-gathering capability. When the question class didn’t match the tool class, the agent still answered, in the vocabulary it had, at whatever confidence its prompt allowed. Nothing in the output format signaled that it was a different class of answer from a real verdict, so downstream consumers treated it like one.

The other failure mode I didn’t expect

While the expert was digging through the crawler’s history, they noticed something else. Issue #805 in my repo is a navigation bug: when you’re drilled into the capture screen on a specific building, hitting the back button should go up one level, not jump all the way out to the project picker. On April 14th, the crawler ran and correctly flagged it as Active, with a file:line citation pointing at the back-button handler in the Svelte layout file.

Three days later, April 17th, the crawler ran again. Nothing had changed in the code. Same back-button behavior. Same bug. The crawler flipped its verdict to verified-fixed.

Running the bot didn’t just produce new wrong answers. It overwrote correct existing ones. The pipeline is non-idempotent in the worst direction: re-running the auditor can subtract from the knowledge base, not add to it.

I don’t know exactly why this happens. The expert and I noticed the label got flipped while the accompanying comment body still said Active (the comment dedupe logic skipped an update, but the label-application logic fired). That’s one mechanism. There may be others. The point isn’t the specific bug. The point is that I was about to treat the output of this pipeline as authoritative, and the pipeline actively corrodes its own correct history.

The distinction I needed

I keep coming back to this, because I think it’s where general-purpose LLM tooling keeps breaking in the same way.

There are two trust levels for any classifier output, and people conflate them constantly.

A signal can be advisory. It says “a human should look at this.” The failure mode of a wrong advisory signal is some wasted human time. A human picks it up, sees that it’s wrong, and moves on.

A signal can be a trust boundary. It says “this is true, downstream automation can act on it.” The failure mode of a wrong trust-boundary signal is silent destructive action. Whatever the downstream does happens, to something that should not have been included.

The crawler’s output is advisory by design. The whole pipeline has human-in-the-loop closers further down: an interactive triage skill, an auto-loop version with its own Review Agent, a directed batch-close that still requires the user to supply issue numbers. None of them trust the crawler’s label alone. They all re-verify before closing. So when the labels are wrong, the damage is bounded by a human reading them. Bounded, not zero. Every wrong label taxes the person doing the review with work they shouldn’t have had to do.

I was going to take that advisory output and use it as a trust boundary for a filter step. Drop these issues before clustering. No human in the loop, because filters don’t prompt, they just filter. Twenty-eight live bugs, silently removed from the planner’s working set on every run, for however long it took me to notice.

The consultation did two things at once. It caught me about to commit a category error, consuming an advisory signal as if it were a trust boundary. And it surfaced the bug underneath, which was that the 28 labels I was about to filter on were wrong in the first place. I wouldn’t have known either problem existed without asking.

What I did instead

I left the Wave-2 grouping work. That piece doesn’t depend on the crawler, and the problem it solves (is this cluster of issues actually about one thing, or a false-positive grouping?) is real regardless.

I parked the pre-pass filter. There’s now a tracking issue on the crawler itself about the verification-impossible classes and the self-reversal bug. Until those are fixed, the right move is to not consume its output as a trust boundary anywhere, and to treat its labels with suspicion even during manual review. The CLAUDE.md warning is now specific enough that anyone reading a labeled issue also reads why the label might be wrong.

I rewrote the gotcha in my project’s CLAUDE.md. The old one said the labels were “unreliable, always verify before closing.” That’s true, but it’s soft. It doesn’t name the mechanism. The new one names both failure modes specifically, with the sample-check numbers and the self-reversal example, so the next time I or another Claude session is about to consume those labels, the warning is specific enough to be load-bearing.

The thing I’ll watch for next time

Whenever I’m about to wire an AI output into an automated pipeline, I now ask one question: what would it take to be wrong here, and what would that wrongness cost?

If the answer is “a human notices and moves on,” the signal is advisory and can be used that way. Advisory signals still cost human time when they’re wrong, but the damage is bounded by someone looking. If the answer is “something happens silently to the wrong thing,” I need a trust boundary, and I need to verify the output actually earns one.

There are two bugs in this story. The crawler’s labels are wrong on 28 issues right now. That’s a real problem in the bot, and it still needs fixing, because even as advisory hints those labels mislead the next human who reads them. My pre-pass filter would have taken those wrong labels and turned them into 28 silently-dropped live bugs on every run. That’s a different problem, on my side. Fixing one doesn’t fix the other. Both of them almost shipped.

I was about to build the wrong filter#

Reading the list of false positives#

The agent confidently labeled things it couldn’t see#

The other failure mode I didn’t expect#

The distinction I needed#

What I did instead#

The thing I’ll watch for next time#