Scenario
A production PostgreSQL primary has a dedicated 200 GB volume for pg_wal. A PagerDuty alert fires at 03:00: the volume is 91% full and growing at ~2 GB per hour. The DBA team checks df -h and confirms pg_wal is ballooning. The application is running normally, so write volume has not changed. Investigation reveals that a logical replication subscriber — a downstream analytics database — was quietly disconnected four days ago after a maintenance window that was never completed. The replication slot it used is still alive on the primary, preventing all WAL cleanup since the subscriber’s last LSN.
How to Identify
Conditions:
pg_replication_slots shows one or more slots with active = false
pg_current_wal_lsn() - restart_lsn for the inactive slot is very large (GBs or more)
- The
pg_wal directory is growing without bound
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) returns a very large byte count
- Disk usage on the WAL partition is climbing even though write volume is constant
- No corresponding subscriber process is connected (
active_pid IS NULL)
Analysis Steps
-- 1. List all replication slots and identify inactive ones
SELECT
slot_name,
plugin,
slot_type,
active,
active_pid,
restart_lsn,
confirmed_flush_lsn,
pg_current_wal_lsn() AS current_lsn,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS bytes_retained,
pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
) AS lag_pretty
FROM pg_replication_slots
ORDER BY bytes_retained DESC NULLS LAST;
-- 2. Check the pg_wal directory size (requires superuser or pg_monitor role)
-- Use: SELECT pg_size_pretty(sum(size)) FROM pg_ls_waldir();
SELECT pg_size_pretty(sum(size)) AS wal_dir_size
FROM pg_ls_waldir();
-- 3. Check max_slot_wal_keep_size setting
SHOW max_slot_wal_keep_size;
-- 4. Check wal_keep_size (separate from slot retention)
SHOW wal_keep_size;
-- 5. Confirm subscriber is truly offline (no active connection)
SELECT slot_name, active, active_pid
FROM pg_replication_slots
WHERE active = false;
Pitfalls
- Replication slots prevent ALL WAL cleanup for segments older than the slot’s
restart_lsn. A single inactive slot can retain indefinitely many WAL segments regardless of wal_keep_size or archiving status.
- Disk full = PostgreSQL crashes. If the WAL partition fills completely, PostgreSQL panics and stops accepting writes. There is no graceful degradation.
- Dropping the slot is safe but irreversible. Once dropped, the subscriber cannot resume from where it left off — it requires a full resync (fresh
pg_basebackup or subscription recreate with copy_data=true).
- Do not delete
.ready or .done files in pg_wal/archive_status to force WAL removal — this bypasses slot accounting and can corrupt recovery.
max_slot_wal_keep_size was added in PostgreSQL 13. On older versions, there is no automatic limit on slot WAL retention; proactive monitoring is the only protection.
Resolution Approach
An inactive replication slot is retaining WAL and filling the disk. Resolve in this order: