Replication Lag Growing —

Scenario

An e-commerce platform runs a primary PostgreSQL server with two streaming standbys — one for read scaling and one for disaster recovery. A midnight batch job ingests 5 million product-price updates. By 01:00 the on-call engineer receives an alert: “Standby lag is 4 minutes and climbing.” Reads from the standby are returning stale prices. The primary itself looks healthy. The team needs to determine whether the lag is transient batch-related I/O saturation or a genuine replication problem that will not self-heal.

How to Identify

Conditions:

pg_stat_replication shows write_lag, flush_lag, and replay_lag columns with non-trivial values and growing over time
sent_lsn is ahead of replay_lsn on the standby by a widening gap
WAL sender process on primary is slow or blocked
bytes_lagging (computed from sent_lsn - replay_lsn) is large and increasing
Standby PostgreSQL logs contain messages about recovery conflicts or cancelled queries
Network throughput between primary and standby is saturated

Analysis Steps

-- 1. Check current replication state on the PRIMARY
SELECT
    client_addr,
    application_name,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    write_lag,
    flush_lag,
    replay_lag,
    sync_state
FROM pg_stat_replication
ORDER BY replay_lag DESC NULLS LAST;

-- 2. Compute bytes of WAL lag for each standby
SELECT
    client_addr,
    application_name,
    pg_wal_lsn_diff(sent_lsn, replay_lsn)  AS bytes_lagging,
    replay_lag
FROM pg_stat_replication;

-- 3. Monitor WAL sender activity (run repeatedly to observe trend)
SELECT pid, wait_event_type, wait_event, query
FROM   pg_stat_activity
WHERE  backend_type = 'walsender';

-- 4. Check for conflicts pausing replay on the STANDBY
SELECT datname,
       confl_tablespace,
       confl_lock,
       confl_snapshot,
       confl_bufferpin,
       confl_deadlock
FROM   pg_stat_database_conflicts;

-- 5. Identify long-running queries on the standby that block replay
SELECT pid, now() - pg_stat_activity.query_start AS duration,
       query, state
FROM   pg_stat_activity
WHERE  state != 'idle'
ORDER  BY duration DESC
LIMIT  10;

Pitfalls

Transient vs. sustained lag: A few seconds of lag during a heavy write burst is normal. Only sustained and growing lag indicates a real problem. Do not panic-restart a healthy WAL sender.
Do not disconnect the replica unnecessarily. If the standby is still making progress (even slowly), disconnecting forces a full resync from a base backup when the slot or WAL segment expires.
Recovery conflicts on standby are a separate but related cause: a long-running query on the standby can cause WAL replay to pause (or cancel the query), which appears as lag. This is controlled by max_standby_streaming_delay and hot_standby_feedback.
Network saturation: During large batch writes the WAL stream itself can saturate the network link. In this case lag is real but self-corrects after the batch. Separate the WAL network interface from application traffic if this recurs.
synchronous_commit and lag: If the standby is synchronous, the primary will throttle commits waiting for the standby to acknowledge. This can cause primary slowdown in addition to standby lag.

Resolution Approach

Replication lag is a symptom; the root cause determines the fix. Work through the following in order:

Replication Lag Growing

What This Is

Why It Matters

Next Step

Scenario

Investigation Path

Scenario

How to Identify

Analysis Steps

Pitfalls

Resolution Approach

Career Impact

Related & next steps

More like this

Concepts on this page

Don't get paged twice for the same bug.

Scenario

Investigation Path

Scenario

How to Identify

Analysis Steps

Pitfalls

Resolution Approach

Related Knowledge

Related Interview Questions

Career Impact

Related & next steps

lan More like this

menu_book Concepts on this page

Don't get paged twice for the same bug.

More like this

Concepts on this page