How Hard NFS Mounts Silently Killed My DNS
The Complaint
A website wouldn’t load. That’s how I found out my entire home network’s DNS was down.
Quick context: I run a Raspberry Pi 4B as the sole DHCP and DNS server for 17 devices via Pi-hole. When this Pi has problems, nobody in the house can get online. High stakes for a home lab.
The Misleading Evidence
I SSH’d into the Infrastructure Pi and ran the obvious command:
systemctl status pihole-FTL
Active. Running. Green checkmark. No errors.
So I blamed the network. Maybe the upstream ISP was having issues? I checked from my laptop - nope, direct DNS queries to 8.8.8.8 worked fine. The problem was definitely my Pi.
I rebooted the Pi. DNS came back for about ten minutes. Then it died again, same symptoms: systemctl status showing active, but actual DNS queries timing out.
Finding the Real Problem
After the second reboot failed to stick, I stopped trying quick fixes and started actually diagnosing. The breakthrough came from checking the mount points:
df -h
The command hung. That’s when I knew something was fundamentally wrong with I/O.
I checked the process state:
ps aux | grep pihole-FTL
The process was in state D - uninterruptible sleep. That means it was blocked on I/O that couldn’t be interrupted even by a kill signal. The process was alive (so systemctl reported active) but completely stuck.
The culprit: /mnt/nfs_client. My NFS mount to the NAS was hung, and Pi-hole was configured to write logs directly to that mount.
Why Hard Mounts Are Dangerous
My /etc/fstab had a standard NFS mount:
192.168.3.5:/volume1/NFSS /mnt/nfs_client nfs defaults,_netdev 0 0
The default NFS behavior is a “hard” mount. When the NFS server becomes unreachable (NAS rebooted, network blip, whatever), the client waits forever for it to come back. Any process that tries to write to that mount blocks in uninterruptible sleep until NFS recovers.
I’d configured Pi-hole, Caddy, and my WireGuard logger to write directly to NFS paths because centralized logging seemed elegant. One place to grep all my logs. Simple.
Until the NAS goes offline for any reason, and suddenly your DNS server is catatonic.
The Architecture Fix
The solution wasn’t just changing the mount options - it was rethinking how logging should work.
Before (brittle):
Pi-hole → direct write to /mnt/nfs_client/pihole/queries.log
Caddy → direct write to /mnt/nfs_client/caddy/access.log
wg-logger → direct write to /mnt/nfs_client/wireguard/connections.log
After (resilient):
Pi-hole → local /var/log/pihole/pihole.log → rsyslog → NFS
Caddy → local /var/log/caddy/access.log → rsyslog → NFS
wg-logger → logger command → syslog → rsyslog → NFS
Services write locally (fast, always works). Rsyslog handles forwarding to NFS in the background. If NFS is unavailable, rsyslog buffers and retries while services continue running.
The NFS Mount Change
I also changed the mount itself from hard to soft:
# /etc/fstab - before
192.168.3.5:/volume1/NFSS /mnt/nfs_client nfs defaults,_netdev 0 0
# /etc/fstab - after
192.168.3.5:/volume1/NFSS /mnt/nfs_client nfs soft,timeo=10,retrans=2,noauto,x-systemd.automount,_netdev 0 0
Key options:
soft- Return errors instead of waiting forevertimeo=10- 1 second timeout (value is in tenths of seconds)retrans=2- Retry twice before failing (~2 seconds total)noauto,x-systemd.automount- Mount on demand, not at boot
Now if NFS becomes unreachable, operations fail with an error in ~2 seconds instead of blocking forever. Services can handle errors; they can’t handle infinite waits.
Validation
After implementing the fix, I deliberately killed the NAS to test. The Infrastructure Pi stayed responsive. Pi-hole kept answering DNS queries. Rsyslog logged errors about NFS being unavailable, but services continued operating.
The centralized logs showed a gap during the NAS outage, then resumed normally when it came back. Exactly the behavior I wanted.
Lessons
1. systemctl status lies about hung processes. A service can be “active” while completely stuck. Check process state with ps aux when behavior doesn’t match status.
2. Hard NFS mounts have no place in critical paths. If a service needs to stay running regardless of network storage availability, it cannot directly depend on NFS.
3. Two-tier logging is worth the complexity. Local logs for reliability, centralized logs for convenience. The indirection through rsyslog adds maybe 2 seconds of latency but prevents service hangs.
4. Test your failure modes. I only discovered this problem because the NAS happened to become unreachable. Now I deliberately test NAS failure as part of verifying any infrastructure changes.
The real lesson: assumptions kill. I assumed NFS “just worked” because it had for months. I assumed centralized logging was purely beneficial. I assumed systemctl status told the truth. Three assumptions, one outage.
This post documents a real infrastructure incident from January 2026. The fix has been validated under deliberate failure conditions.
Written with Claude.