Scenario
A production PostgreSQL cluster handling an e-commerce platform experiences periodic I/O spikes every 2–3 hours. During these spikes, query P95 latency rises from 5ms to 200ms for about 10 minutes. No new deployments correspond with the pattern. A DBA checks pg_stat_activity during a spike and sees several autovacuum worker processes actively vacuuming large tables.
How to Identify
Conditions:
- Periodic I/O spikes correlating with autovacuum worker activity in
pg_stat_activity
- Query latency spikes during autovacuum runs visible in monitoring dashboards
pg_stat_bgwriter shows high buffers_clean activity during the same window
pg_stat_progress_vacuum shows autovacuum workers processing large tables
autovacuum_vacuum_cost_limit is set to a high value or default; autovacuum_vacuum_cost_delay is very low
Analysis Steps
-- 1. Identify autovacuum workers currently running and their targets
SELECT pid,
left(query, 80) AS query,
now() - query_start AS duration,
wait_event_type,
wait_event
FROM pg_stat_activity
WHERE backend_type = 'autovacuum worker'
ORDER BY query_start;
-- 2. Monitor vacuum progress to see how much I/O is being done
SELECT p.relid::regclass AS table_name,
p.phase,
p.heap_blks_total,
p.heap_blks_scanned,
p.heap_blks_vacuumed,
round(p.heap_blks_vacuumed::numeric /
nullif(p.heap_blks_total, 0) * 100, 1) AS pct_done,
p.num_dead_tuples
FROM pg_stat_progress_vacuum p;
-- 3. Check current cost throttling configuration
SHOW autovacuum_vacuum_cost_delay;
SHOW autovacuum_vacuum_cost_limit;
SHOW vacuum_cost_page_hit;
SHOW vacuum_cost_page_miss;
SHOW vacuum_cost_page_dirty;
-- 4. Check bgwriter I/O statistics (reset to baseline then re-check after spike)
SELECT buffers_clean,
maxwritten_clean,
buffers_backend,
buffers_backend_fsync,
buffers_alloc,
now() - stats_reset AS stats_age
FROM pg_stat_bgwriter;
-- 5. Identify tables that trigger large autovacuum runs
SELECT relname,
n_dead_tup,
last_autovacuum,
pg_size_pretty(pg_relation_size(relid)) AS table_size
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
OR pg_relation_size(relid) > 1073741824 -- tables > 1 GB
ORDER BY pg_relation_size(relid) DESC
LIMIT 20;
Pitfalls
- Throttling autovacuum too aggressively (raising cost_delay, lowering cost_limit) reduces I/O during vacuum but allows dead tuples and bloat to accumulate. This leads to even larger, more disruptive vacuum runs later — or worse, transaction ID wraparound.
- Disabling autovacuum on busy tables to avoid I/O spikes is dangerous. Dead tuples accumulate and bloat grows unchecked.
- The real fix is reducing vacuum work per session, not eliminating vacuum. More frequent small vacuum runs are less impactful than infrequent large ones.
autovacuum_vacuum_cost_delay in PostgreSQL 14 defaults to 2ms. In older versions (PG13 and earlier) it defaulted to 20ms. If you upgraded and the default changed, autovacuum may now be much more aggressive.
- Scheduling manual VACUUM during off-hours while keeping autovacuum enabled is a valid strategy, but autovacuum should still remain active to handle unexpected churn.
Resolution Approach
The goal is to spread vacuum I/O over time rather than allowing large bursts: