You’re managing a dozen servers and everything’s running smoothly — until one morning a database stops accepting writes, deployments fail, and logs show “No space left on device.” Disk space monitoring before storage runs out is the single most preventable infrastructure failure, yet it catches teams off guard constantly. This article covers how to set up proper disk space monitoring, what thresholds actually make sense, and how to avoid the 3 AM panic that comes with a full disk.
Why Full Disks Cause More Damage Than You’d Expect
A full disk doesn’t just mean you can’t save files. It cascades. Databases crash because they can’t write transaction logs. Log rotation fails, which means your troubleshooting tools go blind at the exact moment you need them. Swap space fills up and the OOM killer starts terminating random processes. I’ve seen a single full /var partition take down an entire mail queue — thousands of messages stuck because Postfix couldn’t write to its spool directory.
The worst part? By the time you notice, the damage is already done. Recovery from a full disk often involves emergency cleanup while services are down, and that’s when rushed decisions lead to accidentally deleting something important.
The Myth of “I’ll Just Check It Manually”
Here’s a misconception that still persists: “I run df -h every day, so I’ll catch it in time.” No, you won’t. Disk usage doesn’t grow linearly. A server might sit at 60% for months, then a log explosion or an uncompressed backup pushes it to 95% in hours. Manual checks give you a snapshot — monitoring gives you a trend line and an early warning.
The other myth is that 90% is a safe alert threshold. It sounds reasonable, but on a 2 TB drive, 10% free is still 200 GB. On a 20 GB root partition, 10% free is 2 GB — which a busy application can burn through in minutes. Percentage-based thresholds alone are unreliable. You need to combine them with absolute free-space values and growth rate awareness.
Setting Up Disk Space Monitoring That Actually Works
Start with the basics. You need monitoring on every mount point, not just the root partition. A common blind spot is ignoring /tmp, /var/log, or dedicated data partitions. Each one can fill independently and cause different failures.
Here’s what a solid setup looks like:
Step 1: Identify all mount points. Run df -h and lsblk to see everything mounted. Don’t forget tmpfs or any network-attached storage.
Step 2: Set tiered alert thresholds. A two-level approach works well. Warning at 75–80% gives you time to plan. Critical at 90% means act now. But adjust these based on partition size — a 500 GB data volume at 80% is less urgent than a 10 GB root partition at 80%.
Step 3: Monitor inode usage too. A disk can show 50% space free but be completely out of inodes. This happens with applications that create millions of tiny files — session caches, mail queues, thumbnail generators. Check with df -i and make sure your monitoring covers this.
Step 4: Deploy an agent for real-time tracking. With a tool like NetworkVigil, you install a lightweight agent that continuously reports disk metrics back to a central dashboard. No cron job scripting, no parsing output — just install and get visibility immediately.
What to Monitor Beyond Raw Disk Usage
Raw percentage used is table stakes. What separates useful monitoring from noise is trend data. If a partition grew 5% in the last 24 hours, you can predict when it’ll hit critical. If it’s been stable for weeks, a sudden 10% jump in an hour deserves immediate attention.
Track these alongside disk space:
Growth rate: How fast is usage increasing? A steady 1% per week is manageable. A sudden spike means something changed — a runaway log, a backup that didn’t get cleaned up, or an application dumping core files.
Largest directories: When you get an alert, you need to know where the space went. Periodic snapshots of directory sizes under /var, /tmp, and your application paths save you from running frantic du -sh commands during an incident.
Deleted-but-open files: This trips people up regularly. You delete a huge log file but the process holding it open still claims the space. Run lsof +L1 to find these. Disk usage won’t drop until the process releases the file handle.
Having all this visible in a single multi-server dashboard means you’re not SSH-ing into boxes one at a time trying to figure out which server is actually the problem.
Automating Cleanup Before It’s an Emergency
Monitoring tells you there’s a problem. Automation prevents the problem from becoming an outage. Set up basic housekeeping:
Configure logrotate properly. Check that it’s compressing old logs and actually deleting them after a retention period. I’ve seen servers where logrotate was configured but failing silently — the logs just grew forever.
Automate temp file cleanup. Anything in /tmp older than 7 days is usually safe to remove. Cron jobs like find /tmp -type f -mtime +7 -delete are simple and effective.
Set up package cache cleanup. On Debian, apt-get autoclean removes obsolete package files. On systems with Docker, docker system prune reclaims space from unused images, containers, and volumes — this alone can free up tens of gigabytes.
Pair these automations with real-time alerting so you know the moment automated cleanup isn’t keeping up anymore.
FAQ
How often should disk space be checked?
Agent-based monitoring checks every minute or so, which is ideal. If you’re relying on cron-based checks, every 5 minutes is a reasonable minimum. Hourly checks are too infrequent — a lot can happen in 60 minutes on a busy server.
What’s the best alert threshold for disk space?
There’s no universal answer. A warning at 75–80% and critical at 90% is a common starting point, but always factor in absolute free space. A 90% alert on a 50 GB partition means 5 GB free — that might be fine for some workloads and dangerously low for others. Tune thresholds per partition based on how fast each one typically grows.
Can disk space monitoring help with capacity planning?
Absolutely. Historical disk usage trends show you exactly how fast storage consumption grows over weeks and months. This data lets you plan upgrades or storage expansions before you’re in an emergency procurement situation. It’s far cheaper to add a disk on a planned schedule than to rush an upgrade during an outage.
Don’t Wait for the Alert You Should Have Had
Disk space monitoring is one of those things that feels unnecessary until the day it saves you. The setup effort is minimal — install an agent, configure your thresholds, and let the real-time metrics do the watching. The alternative is finding out about a full disk from an angry user or a failed deployment pipeline. Every server in your fleet should have disk monitoring from day one. It’s the cheapest insurance in infrastructure management.
