May 2, 2026
The Monthly Homelab Health Checklist: 12 Checks in 30 Minutes
Homelabs degrade quietly. A 30-minute monthly check covers backups, disks, certs, security updates, and the dozen other things that fail catastrophically when you stop watching them. Here's the checklist and the commands.

Your homelab works. It's been working for months. Then one day Plex won't start because the disk is at 100% from log files you forgot existed, and your last backup was the day after you set up restic and never thought about it again.
Homelabs degrade quietly. Catastrophic failures get attention; slow degradation doesn't. The fix is a monthly check that takes about 30 minutes and catches the slow stuff before it becomes the catastrophic stuff. Here's the version that actually works in practice.
The checklist
Print it. Stick it on the wall by your homelab. Run through it the first Sunday of every month with coffee.
1. Disk space — every drive, not just /
df -h --total
Anything over 80% gets investigated this month. Anything over 90% gets fixed today. Common culprits: Docker images and volumes (docker system df), log files (du -sh /var/log/*), forgotten backups, old kernel images on Ubuntu (apt autoremove).
2. SMART status on every spinning disk
for d in /dev/sd?; do echo "== $d =="; smartctl -H $d; done
Pre-fail or failed = replace this month, not later. Reallocated sectors > 0 = monitor closely.
3. Backups actually completed last cycle
Look at your backup tool's last run timestamp and exit code. Not whether it ran — whether it succeeded. Restic, Borg, rsync, snapshot, whatever you use:
restic snapshots --last 5
borg list ::
If the last snapshot is older than your declared backup interval, fix the cron / systemd-timer that broke. Probably your token expired or your remote ran out of space.
4. Backup restore actually works
Pick one file, restore it to a test directory, diff against the original. Quarterly is acceptable, monthly is better. The first time you actually need a backup is the worst time to discover the restore is broken.
restic restore latest --target /tmp/restore-test --include /etc/hostname
diff /etc/hostname /tmp/restore-test/etc/hostname
5. SSL certs not about to expire
for h in app.example.com api.example.com; do
echo "== $h =="
echo | openssl s_client -servername $h -connect $h:443 2>/dev/null | openssl x509 -noout -enddate
done
Anything within 14 days = trigger renewal manually + investigate why automatic renewal didn't fire.
6. Container health
docker ps --filter health=unhealthy
docker ps --filter status=restarting
Both should be empty. If anything's restarting, check docker logs --tail 50 <container> for what it's complaining about.
7. Memory pressure
free -h
swapon --show
If swap is heavily used and isn't normally, something's leaking. If oom_score_adj matters to you, this is when to set it.
8. Security updates
sudo apt update && apt list --upgradable | grep -i security
Apply security updates this session. Leave non-security updates for when you have time to debug if something breaks.
9. Failed login attempts
sudo lastb | head -20
sudo grep "Failed password" /var/log/auth.log | tail -20
If your VPS is internet-facing, expect noise. Concerning patterns: same IP repeatedly, valid usernames being tried, successful login from an IP you don't recognize.
10. Cron / systemd-timer history
sudo systemctl list-timers --all
journalctl --since '1 month ago' -u 'cron*' | grep -i fail
Anything that should run weekly but hasn't run in a month — investigate.
11. Open ports
sudo ss -tlnp
Compare against expected. New ports without your knowledge = something installed itself or got compromised.
12. Docker network sanity
docker network ls
If you see _old or unfamiliar networks, clean them up — they cause the kind of intermittent routing bug that takes hours to diagnose later.
What gets out of scope
Three things that aren't on this checklist and shouldn't be:
- Major version upgrades (Postgres, your distro, your hypervisor). Schedule those separately. Doing them as part of routine maintenance produces routine outages.
- App-level config changes. The checklist is for infrastructure. Tweaking your Sonarr quality profile is a different activity.
- Anything optional. Adding "clean up Plex thumbnail cache" sounds smart but you'll skip it the third month and feel bad about it. Keep the checklist minimum-viable.
When to upgrade the checklist
After a real incident, add the check that would have caught it. Example: "backup restore actually works" was added by someone who restored from a corrupted restic repo at 2am. "Docker network sanity" was added by someone who lost two hours to the case study earlier in this series.
Don't try to write a perfect checklist on day one. Add to it as your homelab teaches you what to add.
The goal is a 30-minute monthly habit that catches 80% of slow degradation. Not a 4-hour audit nobody actually runs.