Scenario
After an unexpected server shutdown during peak hours, PostgreSQL takes 45 minutes to complete crash recovery and accept connections. During this time, the application is completely down. The PostgreSQL log shows recovery replaying 8 hours of WAL (max_wal_size = 8GB with heavy write load). The DBA realizes that max_wal_size = 8GB means up to 8 GB of WAL can be in-flight between checkpoints — and all of it must be replayed on crash.
How to Identify
Conditions:
- Long PostgreSQL startup time after crash
- PostgreSQL log shows
LOG: redo in progress at X, current LSN Y for a long time
max_wal_size is very large (more WAL between checkpoints = more to replay)
checkpoint_timeout is very long (infrequent checkpoints = more WAL to replay)
- Recovery is replaying from the last checkpoint, which is far in the past
Analysis Steps
-- Check checkpoint configuration
SHOW max_wal_size;
SHOW checkpoint_timeout;
SHOW checkpoint_completion_target;
-- Check pg_stat_bgwriter for checkpoint frequency
SELECT
checkpoints_timed,
checkpoints_req,
round(checkpoint_write_time / 1000.0 /
NULLIF(checkpoints_timed + checkpoints_req, 0), 1) AS avg_write_sec,
now() - stats_reset AS stats_age
FROM pg_stat_bgwriter;
-- Estimate WAL generated between checkpoints (WAL replay time on crash):
-- WAL replay speed ≈ 500 MB/s on modern hardware
-- 8 GB WAL / 500 MB/s = ~16 seconds (actual may be slower due to random I/O)
-- Check when last checkpoint happened:
SELECT checkpoint_lsn, checkpoint_time FROM pg_control_checkpoint();
-- If this is old (many minutes ago): recovery will be long
-- Check WAL generation rate to estimate checkpoint interval:
SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), checkpoint_lsn)) AS wal_since_checkpoint
FROM pg_control_checkpoint();
Pitfalls
- Larger
max_wal_size = fewer checkpoints = more WAL to replay after a crash. This is the fundamental tradeoff: checkpoint overhead vs crash recovery time.
- Reducing
max_wal_size too aggressively increases checkpoint I/O frequency (see P050). Balance both.
checkpoint_timeout is a maximum interval — checkpoints happen MORE frequently if max_wal_size is consumed before the timeout expires.
- Moving to SSDs dramatically speeds up crash recovery — random I/O during replay is the bottleneck.
- WAL replay speed is bounded by the target storage’s random write IOPS.
Resolution Approach
Reduce max_wal_size to limit the recovery window (e.g., 1-2 GB for ≤ 5 minute recovery). Balance with checkpoint I/O overhead by adjusting checkpoint_completion_target. Move pg_wal to fast SSD if WAL replay speed is the bottleneck.