Scenario
An alert fires at 3am: disk usage on the WAL partition is at 85% and rising. The DBA investigates and finds pg_stat_archiver.failed_count has been climbing for 6 hours. The archive destination ran out of space, causing every WAL segment to fail archiving. PostgreSQL retains all un-archived WAL in pg_wal/, which is now filling the disk.
How to Identify
Conditions:
pg_stat_archiver.failed_count increasing over time
pg_wal/ directory filling with .ready files in archive_status/
archive_command returning non-zero exit code
- PostgreSQL logs contain
archive_command failed messages
last_failed_wal and last_failed_time populated in pg_stat_archiver
Analysis Steps
-- 1. Check archiver state
SELECT
archived_count,
last_archived_wal,
last_archived_time,
failed_count,
last_failed_wal,
last_failed_time,
now() - last_archived_time AS time_since_last_success
FROM pg_stat_archiver;
-- 2. Check current archive settings
SHOW archive_command;
SHOW archive_mode;
SHOW wal_level;
-- 3. Count WAL segments waiting to be archived (from OS)
-- ls $PGDATA/pg_wal/archive_status/*.ready | wc -l
-- 4. Check pg_wal directory size
SELECT pg_size_pretty(sum(size)) AS wal_dir_size
FROM pg_ls_waldir();
Pitfalls
- Do not delete
.ready files manually to “hide” the problem — PostgreSQL uses these to track what needs archiving. Deleting them means those WAL segments are never archived, breaking PITR capability silently.
- The archiver processes WAL segments sequentially — it will not skip a failed segment to archive later ones. One stuck segment blocks all subsequent archiving.
- After fixing the archive destination, delete the
.err files (not .ready files) and the archiver will retry automatically.
- Restarting PostgreSQL is NOT required to resume archiving after fixing
archive_command.
Resolution Approach
- Fix the root cause (archive destination full, permission issue, network problem)
- Test
archive_command manually from the OS shell
- Remove
.err status files so PostgreSQL retries the failed segments
- Monitor
archived_count to confirm archiving resumes
Mitigation Actions
-- 1. Identify the failing WAL and root cause
SELECT last_failed_wal, last_failed_time, failed_count FROM pg_stat_archiver;
-- 2. Test archive_command manually from OS shell:
-- cp $PGDATA/pg_wal/<last_failed_wal> /archive/destination/
-- If this fails → reveals the actual error (disk full, permission, network)
-- 3. After fixing root cause, clear .err files to trigger retry:
-- find $PGDATA/pg_wal/archive_status/ -name "*.err" -delete
-- PostgreSQL will retry archiving automatically within archive_timeout
-- 4. Monitor recovery
SELECT archived_count, failed_count, last_archived_time
FROM pg_stat_archiver;
-- archived_count should start increasing, failed_count should stop growing
-- 5. Alert configuration (add to monitoring):
-- Alert when: failed_count increases between checks
-- Alert when: now() - last_archived_time > 10 minutes
-- 6. Preventive: set archive destination monitoring
-- Alert archive destination disk at 70% usage threshold