Scenario
An e-commerce platform runs a primary PostgreSQL server with two streaming standbys — one for read scaling and one for disaster recovery. A midnight batch job ingests 5 million product-price updates. By 01:00 the on-call engineer receives an alert: “Standby lag is 4 minutes and climbing.” Reads from the standby are returning stale prices. The primary itself looks healthy. The team needs to determine whether the lag is transient batch-related I/O saturation or a genuine replication problem that will not self-heal.
How to Identify
Conditions:
pg_stat_replication shows write_lag, flush_lag, and replay_lag columns with non-trivial values and growing over time
sent_lsn is ahead of replay_lsn on the standby by a widening gap
- WAL sender process on primary is slow or blocked
bytes_lagging (computed from sent_lsn - replay_lsn) is large and increasing
- Standby PostgreSQL logs contain messages about recovery conflicts or cancelled queries
- Network throughput between primary and standby is saturated
Analysis Steps
-- 1. Check current replication state on the PRIMARY
SELECT
client_addr,
application_name,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
write_lag,
flush_lag,
replay_lag,
sync_state
FROM pg_stat_replication
ORDER BY replay_lag DESC NULLS LAST;
-- 2. Compute bytes of WAL lag for each standby
SELECT
client_addr,
application_name,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS bytes_lagging,
replay_lag
FROM pg_stat_replication;
-- 3. Monitor WAL sender activity (run repeatedly to observe trend)
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE backend_type = 'walsender';
-- 4. Check for conflicts pausing replay on the STANDBY
SELECT datname,
confl_tablespace,
confl_lock,
confl_snapshot,
confl_bufferpin,
confl_deadlock
FROM pg_stat_database_conflicts;
-- 5. Identify long-running queries on the standby that block replay
SELECT pid, now() - pg_stat_activity.query_start AS duration,
query, state
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC
LIMIT 10;
Pitfalls
- Transient vs. sustained lag: A few seconds of lag during a heavy write burst is normal. Only sustained and growing lag indicates a real problem. Do not panic-restart a healthy WAL sender.
- Do not disconnect the replica unnecessarily. If the standby is still making progress (even slowly), disconnecting forces a full resync from a base backup when the slot or WAL segment expires.
- Recovery conflicts on standby are a separate but related cause: a long-running query on the standby can cause WAL replay to pause (or cancel the query), which appears as lag. This is controlled by
max_standby_streaming_delay and hot_standby_feedback.
- Network saturation: During large batch writes the WAL stream itself can saturate the network link. In this case lag is real but self-corrects after the batch. Separate the WAL network interface from application traffic if this recurs.
- synchronous_commit and lag: If the standby is synchronous, the primary will throttle commits waiting for the standby to acknowledge. This can cause primary slowdown in addition to standby lag.
Resolution Approach
Replication lag is a symptom; the root cause determines the fix. Work through the following in order: