Cookbook recipe

WAL Disk Full

Applies to PostgreSQL 13–17 Last reviewed May 2026 Grounded in source
Estimated investigation4 min

Scenario

Scenario At 3 AM, all write operations on the database start hanging. The on-call engineer checks disk usage and finds /var/lib/postgresql/14/main/pg_wal at 100%. PostgreSQL is alive (can query pg_stat_activity) but all INSERTs/UPDATEs are blocked in wait_event…

Investigation Path

Scenario

At 3 AM, all write operations on the database start hanging. The on-call engineer checks disk usage and finds /var/lib/postgresql/14/main/pg_wal at 100%. PostgreSQL is alive (can query pg_stat_activity) but all INSERTs/UPDATEs are blocked in wait_event = 'WALWrite'. The pg_wal directory has grown to 50 GB; wal_keep_size = 4096 (4 GB) and there’s a stale replication slot with active = false.

How to Identify

Conditions:

  • df -h shows pg_wal filesystem at 100%
  • All write backends stuck in wait_event = 'WALWrite'
  • pg_replication_slots has inactive slot (active = false) with low restart_lsn
  • wal_keep_size is too large for available disk
  • WAL archiving is failing (pg_stat_archiver.failed_count increasing)

Analysis Steps

-- Check which backends are stuck
SELECT pid, usename, state, wait_event_type, wait_event, query_start, query
FROM pg_stat_activity
WHERE wait_event = 'WALWrite'
ORDER BY query_start;

-- Check replication slots — inactive slots hold WAL indefinitely
SELECT slot_name, active, restart_lsn, wal_status,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS held_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;
-- Large held_wal + active=false = this slot is causing WAL accumulation

-- Check wal_keep_size
SHOW wal_keep_size;
-- If very large: PostgreSQL keeps all this WAL regardless of replication slots

-- Check WAL archiver
SELECT failed_count, last_failed_wal, last_failed_time FROM pg_stat_archiver;
-- If failing: WAL is accumulating because archiving can't clear it

-- Check current WAL size
SELECT pg_size_pretty(sum(size)) AS pg_wal_size
FROM pg_ls_waldir();

Pitfalls

  • Dropping an inactive replication slot immediately releases held WAL — but verify that no process depends on it first.
  • Reducing wal_keep_size alone does not reclaim space immediately — space is reclaimed at the next checkpoint.
  • Do NOT delete files from pg_wal/ manually — PostgreSQL manages this directory exclusively. Manual deletion causes data corruption.
  • pg_wal on the same filesystem as pg_data is a common architectural mistake — WAL growth should not starve the data directory.
  • Emergency fix: max_slot_wal_keep_size limits how much WAL a slot can hold before it’s marked lost and invalidated.

Resolution Approach

Immediately drop the inactive replication slot causing WAL bloat. Then address the root cause: reduce wal_keep_size, fix WAL archiving, and set max_slot_wal_keep_size to prevent recurrence. Long term: put pg_wal on a separate filesystem.

This is a Pro lesson

Get every Learning Pathway and cookbook recipe — grounded in PostgreSQL source code, with diagnostics, fixes, and prevention for each topic.

Continue this lesson to learn:

  • Mitigation Actions
  • All 36 Learning Pathway lessons
  • 170+ cookbook recipes
  • Source-grounded diagnostics & fixes

Secure checkout Cancel anytime Source-grounded

Career Impact

This scenario builds production judgment and operational confidence under pressure.

Open Career Dashboard →

Keep going

Related & next steps

Was this helpful?

← All cookbook recipes