Replication Slot WAL Bloat —

Scenario

A production PostgreSQL primary has a dedicated 200 GB volume for pg_wal. A PagerDuty alert fires at 03:00: the volume is 91% full and growing at ~2 GB per hour. The DBA team checks df -h and confirms pg_wal is ballooning. The application is running normally, so write volume has not changed. Investigation reveals that a logical replication subscriber — a downstream analytics database — was quietly disconnected four days ago after a maintenance window that was never completed. The replication slot it used is still alive on the primary, preventing all WAL cleanup since the subscriber’s last LSN.

How to Identify

Conditions:

pg_replication_slots shows one or more slots with active = false
pg_current_wal_lsn() - restart_lsn for the inactive slot is very large (GBs or more)
The pg_wal directory is growing without bound
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) returns a very large byte count
Disk usage on the WAL partition is climbing even though write volume is constant
No corresponding subscriber process is connected (active_pid IS NULL)

Analysis Steps

-- 1. List all replication slots and identify inactive ones
SELECT
    slot_name,
    plugin,
    slot_type,
    active,
    active_pid,
    restart_lsn,
    confirmed_flush_lsn,
    pg_current_wal_lsn()                                    AS current_lsn,
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)     AS bytes_retained,
    pg_size_pretty(
        pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
    )                                                        AS lag_pretty
FROM pg_replication_slots
ORDER BY bytes_retained DESC NULLS LAST;

-- 2. Check the pg_wal directory size (requires superuser or pg_monitor role)
--    Use: SELECT pg_size_pretty(sum(size)) FROM pg_ls_waldir();
SELECT pg_size_pretty(sum(size)) AS wal_dir_size
FROM   pg_ls_waldir();

-- 3. Check max_slot_wal_keep_size setting
SHOW max_slot_wal_keep_size;

-- 4. Check wal_keep_size (separate from slot retention)
SHOW wal_keep_size;

-- 5. Confirm subscriber is truly offline (no active connection)
SELECT slot_name, active, active_pid
FROM   pg_replication_slots
WHERE  active = false;

Pitfalls

Replication slots prevent ALL WAL cleanup for segments older than the slot’s restart_lsn. A single inactive slot can retain indefinitely many WAL segments regardless of wal_keep_size or archiving status.
Disk full = PostgreSQL crashes. If the WAL partition fills completely, PostgreSQL panics and stops accepting writes. There is no graceful degradation.
Dropping the slot is safe but irreversible. Once dropped, the subscriber cannot resume from where it left off — it requires a full resync (fresh pg_basebackup or subscription recreate with copy_data=true).
Do not delete .ready or .done files in pg_wal/archive_status to force WAL removal — this bypasses slot accounting and can corrupt recovery.
max_slot_wal_keep_size was added in PostgreSQL 13. On older versions, there is no automatic limit on slot WAL retention; proactive monitoring is the only protection.

Resolution Approach

An inactive replication slot is retaining WAL and filling the disk. Resolve in this order:

Replication Slot WAL Bloat

What This Is

Why It Matters

Next Step

Scenario

Investigation Path

Scenario

How to Identify

Analysis Steps

Pitfalls

Resolution Approach

Career Impact

Related & next steps

More like this

Concepts on this page

Don't get paged twice for the same bug.

Scenario

Investigation Path

Scenario

How to Identify

Analysis Steps

Pitfalls

Resolution Approach

Related Knowledge

Related Interview Questions

Career Impact

Related & next steps

lan More like this

menu_book Concepts on this page

Don't get paged twice for the same bug.

More like this

Concepts on this page