Scenario
The team performs monthly restore drills. This month, a restore from the nightly pg_basebackup fails: PostgreSQL starts, but immediately exits with PANIC: could not locate a valid checkpoint record or FATAL: database system identifier differs between pg_control and pg_filenode.map. The backup was taken while the database was running without proper backup mode handling.
How to Identify
Conditions:
- Restore test PostgreSQL exits immediately after start
pg_controldata shows inconsistent state in the backup directory
- Backup was taken with
rsync or filesystem copy without putting the cluster into backup mode
backup_label file missing from backup directory
pg_filenode.map and pg_control have mismatched system identifiers
Analysis Steps
-- After restore, check control file consistency (OS-level):
-- pg_controldata $PGDATA | grep -E "state|checkpoint|identifier"
-- "Database cluster state: in production" in a backup copy = corrupt/inconsistent
-- Check if backup_label exists (required for pg_basebackup backups):
-- ls -la $PGDATA/backup_label
-- Missing backup_label = backup was not taken correctly (filesystem copy without backup mode)
-- Check PostgreSQL logs for PANIC:
-- grep -i "PANIC\|FATAL\|checkpoint" $PGDATA/log/postgresql*.log | tail -20
-- Verify system identifier matches (in running system, compare with backup):
SELECT system_identifier FROM pg_control_system();
-- If this doesn't match pg_controldata output from backup: corrupt backup
-- Check backup_manifest if available (pg_basebackup generates one):
-- pg_verifybackup /backup/20240115/
-- This validates checksums and manifest
Pitfalls
- Copying a live PostgreSQL data directory with
rsync without backup mode produces an inconsistent backup — individual files are from different points in time.
pg_basebackup handles backup mode automatically. Manual rsync must call pg_backup_start() / pg_backup_stop() (PG15+, formerly pg_start_backup()/pg_stop_backup()).
- The
backup_label file signals to PostgreSQL that recovery is needed. Without it, PostgreSQL tries to resume from the data files’ embedded state — which is inconsistent.
pg_verifybackup (PG13+) validates the backup manifest generated by pg_basebackup. Always run it after taking a backup.
- Encrypted or compressed backups that are corrupted mid-file produce identical symptoms at restore time.
Resolution Approach
Use pg_basebackup (not rsync) for online backups. Always run pg_verifybackup immediately after backup to catch corruption before it’s needed. For filesystem-level backups, use the low-level backup API (pg_backup_start/pg_backup_stop) to ensure consistency.