Monthly Homelab Health Checklist — 12 Checks in 30 Minutes

Your homelab works. It's been working for months. Then one day Plex won't start because the disk is at 100% from log files you forgot existed, and your last backup was the day after you set up restic and never thought about it again.

Homelabs degrade quietly. Catastrophic failures get attention; slow degradation doesn't. The fix is a monthly check that takes about 30 minutes and catches the slow stuff before it becomes the catastrophic stuff. Here's the version that actually works in practice.

The checklist

Print it. Stick it on the wall by your homelab. Run through it the first Sunday of every month with coffee.

1. Disk space — every drive, not just `/`

df -h --total

Anything over 80% gets investigated this month. Anything over 90% gets fixed today. Common culprits: Docker images and volumes (docker system df), log files (du -sh /var/log/*), forgotten backups, old kernel images on Ubuntu (apt autoremove).

2. SMART status on every spinning disk

for d in /dev/sd?; do echo "== $d =="; smartctl -H $d; done

Pre-fail or failed = replace this month, not later. Reallocated sectors > 0 = monitor closely.

3. Backups actually completed last cycle

Look at your backup tool's last run timestamp and exit code. Not whether it ran — whether it succeeded. Restic, Borg, rsync, snapshot, whatever you use:

restic snapshots --last 5
borg list ::

If the last snapshot is older than your declared backup interval, fix the cron / systemd-timer that broke. Probably your token expired or your remote ran out of space.

4. Backup restore actually works

Pick one file, restore it to a test directory, diff against the original. Quarterly is acceptable, monthly is better. The first time you actually need a backup is the worst time to discover the restore is broken.

restic restore latest --target /tmp/restore-test --include /etc/hostname
diff /etc/hostname /tmp/restore-test/etc/hostname

5. SSL certs not about to expire

for h in app.example.com api.example.com; do
  echo "== $h =="
  echo | openssl s_client -servername $h -connect $h:443 2>/dev/null | openssl x509 -noout -enddate
done

Anything within 14 days = trigger renewal manually + investigate why automatic renewal didn't fire.

6. Container health

docker ps --filter health=unhealthy
docker ps --filter status=restarting

Both should be empty. If anything's restarting, check docker logs --tail 50 <container> for what it's complaining about.

7. Memory pressure

free -h
swapon --show

If swap is heavily used and isn't normally, something's leaking. If oom_score_adj matters to you, this is when to set it.

8. Security updates

sudo apt update && apt list --upgradable | grep -i security

Apply security updates this session. Leave non-security updates for when you have time to debug if something breaks.

9. Failed login attempts

sudo lastb | head -20
sudo grep "Failed password" /var/log/auth.log | tail -20

If your VPS is internet-facing, expect noise. Concerning patterns: same IP repeatedly, valid usernames being tried, successful login from an IP you don't recognize.

10. Cron / systemd-timer history

sudo systemctl list-timers --all
journalctl --since '1 month ago' -u 'cron*' | grep -i fail

Anything that should run weekly but hasn't run in a month — investigate.

11. Open ports

sudo ss -tlnp

Compare against expected. New ports without your knowledge = something installed itself or got compromised.

12. Docker network sanity

docker network ls

If you see _old or unfamiliar networks, clean them up — they cause the kind of intermittent routing bug that takes hours to diagnose later.

What gets out of scope

Three things that aren't on this checklist and shouldn't be:

Major version upgrades (Postgres, your distro, your hypervisor). Schedule those separately. Doing them as part of routine maintenance produces routine outages.
App-level config changes. The checklist is for infrastructure. Tweaking your Sonarr quality profile is a different activity.
Anything optional. Adding "clean up Plex thumbnail cache" sounds smart but you'll skip it the third month and feel bad about it. Keep the checklist minimum-viable.

When to upgrade the checklist

After a real incident, add the check that would have caught it. Example: "backup restore actually works" was added by someone who restored from a corrupted restic repo at 2am. "Docker network sanity" was added by someone who lost two hours to the case study earlier in this series.

Don't try to write a perfect checklist on day one. Add to it as your homelab teaches you what to add.

The goal is a 30-minute monthly habit that catches 80% of slow degradation. Not a 4-hour audit nobody actually runs.

The Monthly Homelab Health Checklist: 12 Checks in 30 Minutes

The checklist

1. Disk space — every drive, not just `/`

2. SMART status on every spinning disk

3. Backups actually completed last cycle

4. Backup restore actually works

5. SSL certs not about to expire

6. Container health

7. Memory pressure

8. Security updates

9. Failed login attempts

10. Cron / systemd-timer history

11. Open ports

12. Docker network sanity

What gets out of scope

When to upgrade the checklist

From across the StoicSoft network

Default-Safe Starter Configs: The Onboarding Pattern Self-Hosted Tools Should Adopt

The checklist

1. Disk space — every drive, not just /

2. SMART status on every spinning disk

3. Backups actually completed last cycle

4. Backup restore actually works

5. SSL certs not about to expire

6. Container health

7. Memory pressure

8. Security updates

9. Failed login attempts

10. Cron / systemd-timer history

11. Open ports

12. Docker network sanity

What gets out of scope

When to upgrade the checklist

From across the StoicSoft network

Default-Safe Starter Configs: The Onboarding Pattern Self-Hosted Tools Should Adopt

1. Disk space — every drive, not just `/`