Scenario
At 3 AM, all write operations on the database start hanging. The on-call engineer checks disk usage and finds /var/lib/postgresql/14/main/pg_wal at 100%. PostgreSQL is alive (can query pg_stat_activity) but all INSERTs/UPDATEs are blocked in wait_event = 'WALWrite'. The pg_wal directory has grown to 50 GB; wal_keep_size = 4096 (4 GB) and there’s a stale replication slot with active = false.
How to Identify
Conditions:
df -h shows pg_wal filesystem at 100%
- All write backends stuck in
wait_event = 'WALWrite'
pg_replication_slots has inactive slot (active = false) with low restart_lsn
wal_keep_size is too large for available disk
- WAL archiving is failing (
pg_stat_archiver.failed_count increasing)
Analysis Steps
-- Check which backends are stuck
SELECT pid, usename, state, wait_event_type, wait_event, query_start, query
FROM pg_stat_activity
WHERE wait_event = 'WALWrite'
ORDER BY query_start;
-- Check replication slots — inactive slots hold WAL indefinitely
SELECT slot_name, active, restart_lsn, wal_status,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS held_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;
-- Large held_wal + active=false = this slot is causing WAL accumulation
-- Check wal_keep_size
SHOW wal_keep_size;
-- If very large: PostgreSQL keeps all this WAL regardless of replication slots
-- Check WAL archiver
SELECT failed_count, last_failed_wal, last_failed_time FROM pg_stat_archiver;
-- If failing: WAL is accumulating because archiving can't clear it
-- Check current WAL size
SELECT pg_size_pretty(sum(size)) AS pg_wal_size
FROM pg_ls_waldir();
Pitfalls
- Dropping an inactive replication slot immediately releases held WAL — but verify that no process depends on it first.
- Reducing
wal_keep_size alone does not reclaim space immediately — space is reclaimed at the next checkpoint.
- Do NOT delete files from
pg_wal/ manually — PostgreSQL manages this directory exclusively. Manual deletion causes data corruption.
pg_wal on the same filesystem as pg_data is a common architectural mistake — WAL growth should not starve the data directory.
- Emergency fix:
max_slot_wal_keep_size limits how much WAL a slot can hold before it’s marked lost and invalidated.
Resolution Approach
Immediately drop the inactive replication slot causing WAL bloat. Then address the root cause: reduce wal_keep_size, fix WAL archiving, and set max_slot_wal_keep_size to prevent recurrence. Long term: put pg_wal on a separate filesystem.