Cookbook recipe

Replication Lag Growing

Applies to PostgreSQL 13–17 Last reviewed May 2026 Grounded in source
Estimated investigation5 min

Scenario

Scenario An e-commerce platform runs a primary PostgreSQL server with two streaming standbys — one for read scaling and one for disaster recovery. A midnight batch job ingests 5 million product-price updates. By 01:00 the on-call…

Investigation Path

Scenario

An e-commerce platform runs a primary PostgreSQL server with two streaming standbys — one for read scaling and one for disaster recovery. A midnight batch job ingests 5 million product-price updates. By 01:00 the on-call engineer receives an alert: “Standby lag is 4 minutes and climbing.” Reads from the standby are returning stale prices. The primary itself looks healthy. The team needs to determine whether the lag is transient batch-related I/O saturation or a genuine replication problem that will not self-heal.

How to Identify

Conditions:

  • pg_stat_replication shows write_lag, flush_lag, and replay_lag columns with non-trivial values and growing over time
  • sent_lsn is ahead of replay_lsn on the standby by a widening gap
  • WAL sender process on primary is slow or blocked
  • bytes_lagging (computed from sent_lsn - replay_lsn) is large and increasing
  • Standby PostgreSQL logs contain messages about recovery conflicts or cancelled queries
  • Network throughput between primary and standby is saturated

Analysis Steps

-- 1. Check current replication state on the PRIMARY
SELECT
    client_addr,
    application_name,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    write_lag,
    flush_lag,
    replay_lag,
    sync_state
FROM pg_stat_replication
ORDER BY replay_lag DESC NULLS LAST;

-- 2. Compute bytes of WAL lag for each standby
SELECT
    client_addr,
    application_name,
    pg_wal_lsn_diff(sent_lsn, replay_lsn)  AS bytes_lagging,
    replay_lag
FROM pg_stat_replication;

-- 3. Monitor WAL sender activity (run repeatedly to observe trend)
SELECT pid, wait_event_type, wait_event, query
FROM   pg_stat_activity
WHERE  backend_type = 'walsender';

-- 4. Check for conflicts pausing replay on the STANDBY
SELECT datname,
       confl_tablespace,
       confl_lock,
       confl_snapshot,
       confl_bufferpin,
       confl_deadlock
FROM   pg_stat_database_conflicts;

-- 5. Identify long-running queries on the standby that block replay
SELECT pid, now() - pg_stat_activity.query_start AS duration,
       query, state
FROM   pg_stat_activity
WHERE  state != 'idle'
ORDER  BY duration DESC
LIMIT  10;

Pitfalls

  • Transient vs. sustained lag: A few seconds of lag during a heavy write burst is normal. Only sustained and growing lag indicates a real problem. Do not panic-restart a healthy WAL sender.
  • Do not disconnect the replica unnecessarily. If the standby is still making progress (even slowly), disconnecting forces a full resync from a base backup when the slot or WAL segment expires.
  • Recovery conflicts on standby are a separate but related cause: a long-running query on the standby can cause WAL replay to pause (or cancel the query), which appears as lag. This is controlled by max_standby_streaming_delay and hot_standby_feedback.
  • Network saturation: During large batch writes the WAL stream itself can saturate the network link. In this case lag is real but self-corrects after the batch. Separate the WAL network interface from application traffic if this recurs.
  • synchronous_commit and lag: If the standby is synchronous, the primary will throttle commits waiting for the standby to acknowledge. This can cause primary slowdown in addition to standby lag.

Resolution Approach

Replication lag is a symptom; the root cause determines the fix. Work through the following in order:

This is a Pro lesson

Get every Learning Pathway and cookbook recipe — grounded in PostgreSQL source code, with diagnostics, fixes, and prevention for each topic.

Continue this lesson to learn:

  • Mitigation Actions
  • All 36 Learning Pathway lessons
  • 170+ cookbook recipes
  • Source-grounded diagnostics & fixes

Secure checkout Cancel anytime Source-grounded

Career Impact

This scenario builds production judgment and operational confidence under pressure.

Open Career Dashboard →

Keep going

Related & next steps

Was this helpful?

← All cookbook recipes