Scenario
A nightly pg_basebackup job started failing with FATAL: could not connect to server or ERROR: replication slot "backup_slot" does not exist. On other nights it fails with ERROR: requested WAL segment has already been recycled. The team has no WAL archiving configured — they rely solely on pg_basebackup for backups.
How to Identify
Conditions:
pg_basebackup exits non-zero with connection or replication errors
- No WAL archiving means WAL needed during long backup may be recycled
max_wal_senders limit hit by concurrent standbys + backup connections
- Backup replication slot referenced in backup script was dropped manually
wal_keep_size too small for long-running backups
Analysis Steps
-- Check max_wal_senders vs current usage
SHOW max_wal_senders;
SELECT count(*) FROM pg_stat_activity WHERE backend_type = 'walsender';
SELECT count(*) FROM pg_stat_replication;
-- Check if backup slot exists
SELECT slot_name, active, restart_lsn, wal_status
FROM pg_replication_slots
WHERE slot_name = 'backup_slot';
-- wal_status = 'lost' → slot WAL has been recycled (slot is stale)
-- Check wal_keep_size
SHOW wal_keep_size;
-- '0' = no WAL kept beyond what's needed for standbys (risky for long backups)
-- Check WAL archiving status
SHOW archive_mode;
SHOW archive_command;
-- If archive_mode=off and wal_keep_size=0: long backups can fail
-- if WAL from backup start is recycled before backup completes
-- Check pg_stat_archiver for archiving health
SELECT archived_count, last_archived_wal, failed_count, last_failed_wal, last_failed_time
FROM pg_stat_archiver;
Pitfalls
pg_basebackup streams WAL during backup. If the WAL from the start of the backup is recycled before the backup finishes, the backup is incomplete.
- Using a replication slot with
pg_basebackup -S slot_name prevents WAL recycling but causes WAL bloat if backups fail and the slot stays active.
max_wal_senders must account for standbys + pg_basebackup connections + pg_replication_slots usage.
- Without WAL archiving, point-in-time recovery (PITR) is impossible — only crash recovery to the backup point is available.
- Never take
pg_basebackup on a replica if the replica has a recovery_min_apply_delay set — the backup may include inconsistent state.
Resolution Approach
Set wal_keep_size large enough to survive the longest expected backup window. Use a dedicated replication slot for backups (or WAL archiving) to guarantee WAL availability. Increase max_wal_senders to accommodate concurrent backups + standbys.