When Your AI Assistant Crashes Your Infrastructure
The Task
Simple enough: add UniFi firewall log parsing to CrowdSec so router events would feed into threat detection. Claude had done dozens of similar configurations before. SSH in, install a collection, add an acquisition config, reload the service.
What could go wrong?
What Went Wrong
Claude ran multiple SSH commands in parallel to speed things up. Some of those commands (cscli operations) are slow - they hit the CrowdSec API and can take 10-30 seconds. Those SSH sessions stayed open, waiting.
Claude tried more commands. More SSH sessions opened. The Infrastructure Pi runs Dropbear, a lightweight SSH daemon with a limit of about 10 concurrent connections.
Claude hit that limit. New SSH connections failed with “banner exchange” timeouts. Claude, not understanding why SSH was suddenly broken, tried even more commands. Each attempt held another connection slot.
Within a few minutes, the Infrastructure Pi was completely unresponsive. SSH dead. DNS dead. The entire home network went down because Pi-hole couldn’t answer queries.
I rebooted the Pi from physical access.
Making It Worse
After the reboot, CrowdSec failed to start. Config file issues from the interrupted installation. Claude started troubleshooting - which meant more SSH commands, more potential for hanging.
Then Claude made two critical mistakes:
Let CrowdSec restart in a loop. Each failed start attempt hit the CrowdSec CAPI (Central API). Multiple rapid failures from the same IP triggered rate limiting. My IP got blocked from api.crowdsec.net entirely.
Deleted credential files during troubleshooting. Trying to “clean up” the config, Claude removed
online_api_credentials.yaml. That file contains the registration token for the CrowdSec API. Gone.
The result: CrowdSec completely broken, rate-limited from the central API, and missing the credentials needed to re-register even after the rate limit cleared.
The Recovery
Recovery took two sessions:
Session 63 (damage):
- Infrastructure Pi rebooted manually
- CrowdSec stopped and disabled
- UniFi bouncer container stopped
- Credential files backed up (what remained)
- Waited for rate limiting to clear (~1 hour minimum)
Session 64 (recovery):
- Verified rate limit cleared:
curl -s https://api.crowdsec.net/returned 200 - Re-registered with CAPI:
sudo cscli capi register - Re-enabled CrowdSec:
sudo systemctl enable --now crowdsec - Generated new bouncer API key:
sudo cscli bouncers add unifi-bouncer - Recreated bouncer container with new key
Total time from incident to full recovery: about 4 hours. During that time, no CrowdSec protection, no community blocklist updates, and a lot of manual verification.
The Changes
This incident resulted in a new section in my project’s CLAUDE.md: MANDATORY SAFETY RULES. These are explicit guardrails that Claude must follow when working with infrastructure.
SSH to Infrastructure Pi
- Parallel SSH is OK for quick commands (status checks, ls, grep)
- Slow commands run ONE AT A TIME (cscli, docker, apt, anything that might hang)
- ALWAYS use timeouts: ssh -o ConnectTimeout=10
- If a command seems stuck (>30s): Kill it before trying another
- If SSH becomes unresponsive: STOP. Don't retry. Tell user to check/reboot Pi.
CrowdSec Changes
- NEVER reload/restart CrowdSec without testing config first: crowdsec -t
- BACKUP credentials before ANY change
- If CrowdSec fails to start: STOP IT IMMEDIATELY
- NEVER delete credential files - copy them, don't delete
- If you see restart loops: Disable the service
General Infrastructure Safety
- ASK BEFORE RISKY CHANGES
- One change at a time
- Verify before proceeding
- Have rollback ready
- When things go wrong, STOP
The Meta-Lesson
This wasn’t a bug in Claude. Claude did exactly what it was designed to do: execute commands efficiently, try to fix problems when they arise, work autonomously until the task is complete.
The problem is that those behaviors are dangerous on infrastructure that can’t tolerate aggressive troubleshooting. An AI assistant that keeps trying when things break is helpful for most development work. It’s catastrophic when “keep trying” means saturating connection limits or triggering rate limiting.
The solution isn’t to stop using AI for infrastructure. It’s to define explicit boundaries. The safety rules aren’t suggestions - they’re hard constraints that override normal behavior.
What Changed Permanently
Documented connection limits. Dropbear’s ~10 connection limit is now in the project’s gotchas. Future sessions know to be careful with parallel operations.
Explicit timeouts everywhere. Every SSH command in my infrastructure scripts uses
ConnectTimeout=10. No more indefinite waits.Credential backup before changes. Any CrowdSec config change now starts with copying credential files to
/tmp/. Non-negotiable.Stop on failure. When something breaks unexpectedly, Claude is explicitly instructed to stop and report rather than continue troubleshooting. My job to decide next steps.
Rate limiting awareness. If a service fails repeatedly, the first priority is preventing restart loops, not fixing the config.
The Uncomfortable Truth
An AI assistant operating on critical infrastructure is a force multiplier in both directions. When it works, tasks that would take an hour happen in minutes. When it fails, it can cause more damage faster than a human would because it doesn’t hesitate, doesn’t second-guess, doesn’t feel the network going down.
The guardrails I added aren’t about making Claude less capable. They’re about matching the assistant’s behavior to the tolerance for failure. A development environment can handle aggressive retry loops. A production DNS server cannot.
Every AI-assisted infrastructure setup should have its own version of these rules. What can fail? What’s the blast radius? How do you prevent the assistant from making things worse? Answer those questions before you need to.
This post documents a real infrastructure incident from January 2026. The safety rules described are now part of the project’s permanent configuration.
Written with Claude.