When Your AI Assistant Crashes Your Infrastructure

The Task

Simple enough: add UniFi firewall log parsing to CrowdSec so router events would feed into threat detection. Claude had done dozens of similar configurations before. SSH in, install a collection, add an acquisition config, reload the service.

What could go wrong?

What Went Wrong

Claude ran multiple SSH commands in parallel to speed things up. Some of those commands (cscli operations) are slow - they hit the CrowdSec API and can take 10-30 seconds. Those SSH sessions stayed open, waiting.

Claude tried more commands. More SSH sessions opened. The Infrastructure Pi runs Dropbear, a lightweight SSH daemon with a limit of about 10 concurrent connections.

Claude hit that limit. New SSH connections failed with “banner exchange” timeouts. Claude, not understanding why SSH was suddenly broken, tried even more commands. Each attempt held another connection slot.

Within a few minutes, the Infrastructure Pi was completely unresponsive. SSH dead. DNS dead. The entire home network went down because Pi-hole couldn’t answer queries.

I rebooted the Pi from physical access.

Making It Worse

After the reboot, CrowdSec failed to start. Config file issues from the interrupted installation. Claude started troubleshooting - which meant more SSH commands, more potential for hanging.

Then Claude made two critical mistakes:

Let CrowdSec restart in a loop. Each failed start attempt hit the CrowdSec CAPI (Central API). Multiple rapid failures from the same IP triggered rate limiting. My IP got blocked from api.crowdsec.net entirely.
Deleted credential files during troubleshooting. Trying to “clean up” the config, Claude removed online_api_credentials.yaml. That file contains the registration token for the CrowdSec API. Gone.

The result: CrowdSec completely broken, rate-limited from the central API, and missing the credentials needed to re-register even after the rate limit cleared.

The Recovery

Recovery took two sessions:

Session 63 (damage):

Infrastructure Pi rebooted manually
CrowdSec stopped and disabled
UniFi bouncer container stopped
Credential files backed up (what remained)
Waited for rate limiting to clear (~1 hour minimum)

Session 64 (recovery):

Verified rate limit cleared: curl -s https://api.crowdsec.net/ returned 200
Re-registered with CAPI: sudo cscli capi register
Re-enabled CrowdSec: sudo systemctl enable --now crowdsec
Generated new bouncer API key: sudo cscli bouncers add unifi-bouncer
Recreated bouncer container with new key

Total time from incident to full recovery: about 4 hours. During that time, no CrowdSec protection, no community blocklist updates, and a lot of manual verification.

The Changes

This incident resulted in a new section in my project’s CLAUDE.md: MANDATORY SAFETY RULES. These are explicit guardrails that Claude must follow when working with infrastructure.

SSH to Infrastructure Pi

- Parallel SSH is OK for quick commands (status checks, ls, grep)
- Slow commands run ONE AT A TIME (cscli, docker, apt, anything that might hang)
- ALWAYS use timeouts: ssh -o ConnectTimeout=10
- If a command seems stuck (>30s): Kill it before trying another
- If SSH becomes unresponsive: STOP. Don't retry. Tell user to check/reboot Pi.

CrowdSec Changes

- NEVER reload/restart CrowdSec without testing config first: crowdsec -t
- BACKUP credentials before ANY change
- If CrowdSec fails to start: STOP IT IMMEDIATELY
- NEVER delete credential files - copy them, don't delete
- If you see restart loops: Disable the service

General Infrastructure Safety

- ASK BEFORE RISKY CHANGES
- One change at a time
- Verify before proceeding
- Have rollback ready
- When things go wrong, STOP

The Meta-Lesson

This wasn’t a bug in Claude. Claude did exactly what it was designed to do: execute commands efficiently, try to fix problems when they arise, work autonomously until the task is complete.

The problem is that those behaviors are dangerous on infrastructure that can’t tolerate aggressive troubleshooting. An AI assistant that keeps trying when things break is helpful for most development work. It’s catastrophic when “keep trying” means saturating connection limits or triggering rate limiting.

The solution isn’t to stop using AI for infrastructure. It’s to define explicit boundaries. The safety rules aren’t suggestions - they’re hard constraints that override normal behavior.

What Changed Permanently

Documented connection limits. Dropbear’s ~10 connection limit is now in the project’s gotchas. Future sessions know to be careful with parallel operations.
Explicit timeouts everywhere. Every SSH command in my infrastructure scripts uses ConnectTimeout=10. No more indefinite waits.
Credential backup before changes. Any CrowdSec config change now starts with copying credential files to /tmp/. Non-negotiable.
Stop on failure. When something breaks unexpectedly, Claude is explicitly instructed to stop and report rather than continue troubleshooting. My job to decide next steps.
Rate limiting awareness. If a service fails repeatedly, the first priority is preventing restart loops, not fixing the config.

The Uncomfortable Truth

An AI assistant operating on critical infrastructure is a force multiplier in both directions. When it works, tasks that would take an hour happen in minutes. When it fails, it can cause more damage faster than a human would because it doesn’t hesitate, doesn’t second-guess, doesn’t feel the network going down.

The guardrails I added aren’t about making Claude less capable. They’re about matching the assistant’s behavior to the tolerance for failure. A development environment can handle aggressive retry loops. A production DNS server cannot.

Every AI-assisted infrastructure setup should have its own version of these rules. What can fail? What’s the blast radius? How do you prevent the assistant from making things worse? Answer those questions before you need to.

This post documents a real infrastructure incident from January 2026. The safety rules described are now part of the project’s permanent configuration.

Written with Claude.

When Your AI Assistant Crashes Your Infrastructure#

The Task#

What Went Wrong#

Making It Worse#

The Recovery#

The Changes#

SSH to Infrastructure Pi#

CrowdSec Changes#

General Infrastructure Safety#

The Meta-Lesson#

What Changed Permanently#

The Uncomfortable Truth#