MySQL performance monitoring covers the metrics that separate a stable database from one that silently degrades until something breaks. This article walks through the three areas that cause the most production problems – query performance, connection management, and replication health – with practical guidance on what to track, what thresholds to set, and which warning signs to catch before they become incidents.
A slow MySQL server rarely announces itself. More often it shows up as a slightly longer page load time on a Tuesday morning, a connection timeout that logs a single error, or a replication lag that stays just under the alert threshold for two days straight before finally blowing up. Knowing which metrics to watch – and why – makes the difference between proactive management and firefighting.
Why Query Monitoring Is Where Most DBA Work Lives
Query performance is the most visible layer of MySQL health. When queries slow down, users feel it immediately. The challenge is identifying which queries are responsible and understanding whether the issue is a bad execution plan, missing index, or resource saturation.
The slow query log is the first place to look. Enable it with a threshold of 1–2 seconds initially, then lower it to 200–500ms once you have a baseline. The `long_query_time` variable controls this, and it can be set at runtime without a restart.
Beyond the slow query log, the Performance Schema (enabled by default in MySQL 5.6+) gives per-query execution statistics including wait events, lock time, and rows examined. The `events_statements_summary_by_digest` table aggregates queries by fingerprint, making it easy to find the top offenders by total execution time rather than just single worst-case runs.
Key query metrics to monitor continuously:
Queries per second (QPS) – sudden drops or spikes are early warning signs of a bottleneck or runaway process.
Slow queries per minute – a rising count without a matching rise in QPS means individual query performance is degrading.
Rows examined vs. rows returned ratio – a ratio above 100:1 typically means a missing or unused index. Ratios above 1000:1 are a red flag.
Lock wait time – InnoDB row-level locking means lock contention usually signals either a long-running transaction or a poorly designed write pattern.
Setting up a performance baseline for QPS and average query latency is essential before configuring any alerts. Without a baseline, alert thresholds are guesses.
Connection Management – the Metric That Fills Up Quietly
MySQL has a hard limit on concurrent connections defined by `max_connections`. The default is often 151 on many distributions. Hit that ceiling and new connection attempts will fail immediately with “Too many connections” – one of the most disruptive errors a MySQL server can produce because it affects every application hitting that instance simultaneously.
The dangerous pattern isn’t a spike to 151 connections. It’s a gradual creep from 40 to 80 to 120 over several weeks, usually caused by a connection leak in application code or an ORM that isn’t closing connections properly. By the time the team notices, there’s no headroom left.
Metrics that matter for connection monitoring:
Threads_connected – current open connections. Alert when this exceeds 70–80% of `max_connections`.
Threads_running – connections actively executing queries right now, not just idle ones. A value persistently above 20–30 on a typical web app is worth investigating.
Connection_errors_max_connections – this counter increments every time a connection is rejected. Any non-zero value here means users are already hitting errors.
Aborted_connects and Aborted_clients – aborted connects often point to authentication or network issues; aborted clients mean the application disconnected without properly closing the connection.
Application-layer connection pooling significantly reduces pressure on MySQL’s connection limit, but it doesn’t eliminate the need to monitor these counters. Connection pool monitoring at the application level should sit alongside server-side connection metrics – they tell different parts of the same story.
Replication Lag – the Silent Data Consistency Risk
MySQL replication monitoring is where many teams have a false sense of security. The replica is running, `Slave_IO_Running` and `Slave_SQL_Running` both show “Yes,” so everything must be fine. Not necessarily.
Seconds_Behind_Master is the most-watched replication metric, but it’s also the most misunderstood. It measures how far behind the replica’s SQL thread is relative to the master’s binary log position – not actual wall-clock lag in data availability. It can also show 0 momentarily while the IO thread hasn’t caught up. Use it as a directional signal, not an absolute truth.
A more reliable approach is to use a heartbeat table – write a timestamp to a dedicated table on the master every 1–2 seconds, then measure how stale that value is on the replica. Tools like `pt-heartbeat` from Percona Toolkit implement this pattern and give more accurate lag measurements.
Metrics for replication health monitoring:
Seconds_Behind_Master – alert at 30 seconds for read-sensitive workloads, 5 minutes for more tolerant ones.
Relay_Log_Space – growing relay logs without a corresponding increase in Seconds_Behind_Master can indicate the SQL thread is struggling with a burst of writes.
Slave_SQL_Running_State – the state string shows what the SQL thread is currently doing. “System lock,” “Waiting for table metadata lock,” and “Reading event from the relay log” are all normal; seeing it stuck on the same state for extended periods is not.
Replication thread status – monitoring whether both IO and SQL threads are running should trigger an immediate alert if either stops.
For a broader view of what database metrics to collect alongside replication status, see database health metrics every DBA should monitor.
A Common Myth Worth Busting
The most common misconception about MySQL performance monitoring is that a low CPU usage percentage means the database is healthy. MySQL can be severely bottlenecked on I/O, lock contention, or network throughput while CPU stays under 20%. A server quietly exhausting its InnoDB buffer pool or hitting disk read saturation looks completely fine from a CPU graph alone.
Effective MySQL monitoring requires looking at InnoDB buffer pool hit rate (aim for 99%+), disk read/write IOPS, and query cache efficiency together – not just the headline system metrics. CPU is a lagging indicator at best.
Practical Alert Thresholds to Start With
For teams setting up MySQL monitoring from scratch, a starting point that works in most web application environments:
– Slow queries: alert if count exceeds 10 per minute sustained for 5 minutes.
– Threads_connected: warn at 75% of max_connections, critical at 90%.
– Replication lag: warn at 30 seconds, critical at 5 minutes.
– InnoDB buffer pool hit rate: alert if it drops below 95%.
– QPS drop: alert if QPS drops more than 40% from the 1-hour rolling average (possible crash or connection issue).
These thresholds need tuning once baseline data is collected for a specific environment. A batch-processing MySQL instance behaves very differently from an OLTP database serving 10,000 requests per minute.
Frequently Asked Questions
What is the most important MySQL metric to monitor?
There’s no single most important metric – but if forced to pick one starting point, Threads_running combined with slow query count gives the clearest picture of whether the database is actually serving requests efficiently right now. A high Threads_running with a rising slow query count points directly to a performance problem in progress.
How do I monitor MySQL replication without third-party tools?
Run `SHOW SLAVE STATUSG` on each replica and parse the output. The key fields are Seconds_Behind_Master, Slave_IO_Running, Slave_SQL_Running, and Last_Error. For automated monitoring, most infrastructure monitoring agents can execute this query on a schedule and alert on the returned values. The limitation is accuracy of lag measurement – for production environments, a heartbeat-based approach gives more reliable data.
Does enabling the Performance Schema slow down MySQL?
Performance Schema does add overhead, typically 5–10% in most production workloads. For databases under heavy load, selectively enabling only the instruments you need reduces that cost. In practice, the visibility gained far outweighs the overhead for most teams – running MySQL blind to wait events and query statistics costs more than the small performance tax.
Summary
MySQL performance monitoring comes down to three layers: catching slow and inefficient queries before they affect users, watching connection counts before the ceiling is reached, and verifying that replication is actually current – not just showing green status lights. Start with the slow query log and connection counters, establish baselines, then expand to replication lag and InnoDB internals. The teams that avoid MySQL crises aren’t the ones with the most complex monitoring setups – they’re the ones who got the basics wired up consistently and actually respond to the signals.
