WAL Archive Failure —

Scenario

An alert fires at 3am: disk usage on the WAL partition is at 85% and rising. The DBA investigates and finds pg_stat_archiver.failed_count has been climbing for 6 hours. The archive destination ran out of space, causing every WAL segment to fail archiving. PostgreSQL retains all un-archived WAL in pg_wal/, which is now filling the disk.

How to Identify

Conditions:

pg_stat_archiver.failed_count increasing over time
pg_wal/ directory filling with .ready files in archive_status/
archive_command returning non-zero exit code
PostgreSQL logs contain archive_command failed messages
last_failed_wal and last_failed_time populated in pg_stat_archiver

Analysis Steps

-- 1. Check archiver state
SELECT
    archived_count,
    last_archived_wal,
    last_archived_time,
    failed_count,
    last_failed_wal,
    last_failed_time,
    now() - last_archived_time AS time_since_last_success
FROM pg_stat_archiver;

-- 2. Check current archive settings
SHOW archive_command;
SHOW archive_mode;
SHOW wal_level;

-- 3. Count WAL segments waiting to be archived (from OS)
-- ls $PGDATA/pg_wal/archive_status/*.ready | wc -l

-- 4. Check pg_wal directory size
SELECT pg_size_pretty(sum(size)) AS wal_dir_size
FROM   pg_ls_waldir();

Pitfalls

Do not delete .ready files manually to “hide” the problem — PostgreSQL uses these to track what needs archiving. Deleting them means those WAL segments are never archived, breaking PITR capability silently.
The archiver processes WAL segments sequentially — it will not skip a failed segment to archive later ones. One stuck segment blocks all subsequent archiving.
After fixing the archive destination, delete the .err files (not .ready files) and the archiver will retry automatically.
Restarting PostgreSQL is NOT required to resume archiving after fixing archive_command.

Resolution Approach

Fix the root cause (archive destination full, permission issue, network problem)

Test archive_command manually from the OS shell

Remove .err status files so PostgreSQL retries the failed segments

Monitor archived_count to confirm archiving resumes

Mitigation Actions

-- 1. Identify the failing WAL and root cause
SELECT last_failed_wal, last_failed_time, failed_count FROM pg_stat_archiver;

-- 2. Test archive_command manually from OS shell:
-- cp $PGDATA/pg_wal/<last_failed_wal> /archive/destination/
-- If this fails → reveals the actual error (disk full, permission, network)

-- 3. After fixing root cause, clear .err files to trigger retry:
-- find $PGDATA/pg_wal/archive_status/ -name "*.err" -delete
-- PostgreSQL will retry archiving automatically within archive_timeout

-- 4. Monitor recovery
SELECT archived_count, failed_count, last_archived_time
FROM   pg_stat_archiver;
-- archived_count should start increasing, failed_count should stop growing

-- 5. Alert configuration (add to monitoring):
-- Alert when: failed_count increases between checks
-- Alert when: now() - last_archived_time > 10 minutes

-- 6. Preventive: set archive destination monitoring
-- Alert archive destination disk at 70% usage threshold

WAL Archive Failure

What This Is

Why It Matters

Next Step

Scenario

Investigation Path

Scenario

How to Identify

Analysis Steps

Pitfalls

Resolution Approach

Mitigation Actions

Career Impact

Related & next steps

More like this

Don't get paged twice for the same bug.

Scenario

Investigation Path

Scenario

How to Identify

Analysis Steps

Pitfalls

Resolution Approach

Mitigation Actions

Related Knowledge

Related Interview Questions

Career Impact

Related & next steps

lan More like this

Don't get paged twice for the same bug.

More like this