Autovacuum Killing Performance (I/O Spikes) —

Scenario

A production PostgreSQL cluster handling an e-commerce platform experiences periodic I/O spikes every 2–3 hours. During these spikes, query P95 latency rises from 5ms to 200ms for about 10 minutes. No new deployments correspond with the pattern. A DBA checks pg_stat_activity during a spike and sees several autovacuum worker processes actively vacuuming large tables.

How to Identify

Conditions:

Periodic I/O spikes correlating with autovacuum worker activity in pg_stat_activity
Query latency spikes during autovacuum runs visible in monitoring dashboards
pg_stat_bgwriter shows high buffers_clean activity during the same window
pg_stat_progress_vacuum shows autovacuum workers processing large tables
autovacuum_vacuum_cost_limit is set to a high value or default; autovacuum_vacuum_cost_delay is very low

Analysis Steps

-- 1. Identify autovacuum workers currently running and their targets
SELECT pid,
       left(query, 80)               AS query,
       now() - query_start           AS duration,
       wait_event_type,
       wait_event
FROM   pg_stat_activity
WHERE  backend_type = 'autovacuum worker'
ORDER  BY query_start;

-- 2. Monitor vacuum progress to see how much I/O is being done
SELECT p.relid::regclass             AS table_name,
       p.phase,
       p.heap_blks_total,
       p.heap_blks_scanned,
       p.heap_blks_vacuumed,
       round(p.heap_blks_vacuumed::numeric /
             nullif(p.heap_blks_total, 0) * 100, 1) AS pct_done,
       p.num_dead_tuples
FROM   pg_stat_progress_vacuum p;

-- 3. Check current cost throttling configuration
SHOW autovacuum_vacuum_cost_delay;
SHOW autovacuum_vacuum_cost_limit;
SHOW vacuum_cost_page_hit;
SHOW vacuum_cost_page_miss;
SHOW vacuum_cost_page_dirty;

-- 4. Check bgwriter I/O statistics (reset to baseline then re-check after spike)
SELECT buffers_clean,
       maxwritten_clean,
       buffers_backend,
       buffers_backend_fsync,
       buffers_alloc,
       now() - stats_reset AS stats_age
FROM   pg_stat_bgwriter;

-- 5. Identify tables that trigger large autovacuum runs
SELECT relname,
       n_dead_tup,
       last_autovacuum,
       pg_size_pretty(pg_relation_size(relid)) AS table_size
FROM   pg_stat_user_tables
WHERE  n_dead_tup > 10000
  OR   pg_relation_size(relid) > 1073741824    -- tables > 1 GB
ORDER  BY pg_relation_size(relid) DESC
LIMIT  20;

Pitfalls

Throttling autovacuum too aggressively (raising cost_delay, lowering cost_limit) reduces I/O during vacuum but allows dead tuples and bloat to accumulate. This leads to even larger, more disruptive vacuum runs later — or worse, transaction ID wraparound.
Disabling autovacuum on busy tables to avoid I/O spikes is dangerous. Dead tuples accumulate and bloat grows unchecked.
The real fix is reducing vacuum work per session, not eliminating vacuum. More frequent small vacuum runs are less impactful than infrequent large ones.
autovacuum_vacuum_cost_delay in PostgreSQL 14 defaults to 2ms. In older versions (PG13 and earlier) it defaulted to 20ms. If you upgraded and the default changed, autovacuum may now be much more aggressive.
Scheduling manual VACUUM during off-hours while keeping autovacuum enabled is a valid strategy, but autovacuum should still remain active to handle unexpected churn.

Resolution Approach

The goal is to spread vacuum I/O over time rather than allowing large bursts:

Autovacuum Killing Performance (I/O Spikes)

What This Is

Why It Matters

Next Step

Scenario

Investigation Path

Scenario

How to Identify

Analysis Steps

Pitfalls

Resolution Approach

Career Impact

Related & next steps

More like this

Concepts on this page

Don't get paged twice for the same bug.

Scenario

Investigation Path

Scenario

How to Identify

Analysis Steps

Pitfalls

Resolution Approach

Related Knowledge

Related Interview Questions

Career Impact

Related & next steps

lan More like this

menu_book Concepts on this page

Don't get paged twice for the same bug.

More like this

Concepts on this page