Cookbook recipe

Split-Brain After Network Partition

Applies to PostgreSQL 13–17 Last reviewed May 2026 Grounded in source
Estimated investigation4 min

Scenario

Scenario A 30-second network partition between the primary and standby causes the automatic failover system (Patroni/repmgr) to promote the standby. When the network recovers, both nodes are now accepting writes — the original primary continued accepting…

Investigation Path

Scenario

A 30-second network partition between the primary and standby causes the automatic failover system (Patroni/repmgr) to promote the standby. When the network recovers, both nodes are now accepting writes — the original primary continued accepting writes during the partition, and the newly promoted standby also accepted writes. This is a split-brain scenario. Data written to each node is now divergent.

How to Identify

Conditions:

  • Both servers return pg_is_in_recovery() = false
  • pg_current_wal_lsn() is different on both nodes after the partition
  • WAL timeline on the new primary is higher than the old primary’s timeline
  • Application writes went to different nodes simultaneously
  • Fencing (STONITH) was not in place or failed

Analysis Steps

-- On NODE 1 (original primary): check status
SELECT pg_is_in_recovery()      AS in_recovery,
       pg_current_wal_lsn()     AS current_lsn,
       timeline_id               AS timeline
FROM pg_control_checkpoint();

-- On NODE 2 (promoted standby): check status
SELECT pg_is_in_recovery()      AS in_recovery,
       pg_current_wal_lsn()     AS current_lsn,
       timeline_id               AS timeline
FROM pg_control_checkpoint();

-- If both return in_recovery=false: split-brain confirmed
-- The node with higher timeline_id is the "correct" new primary

-- On original primary: check if any writes happened during the partition
-- (look for transactions with XID > the last replicated XID)
SELECT pg_current_wal_lsn() - pg_control_checkpoint().checkpoint_lsn AS wal_since_promotion;

-- On new primary: check timeline history
-- ls $PGDATA/pg_wal/*.history   ← shows when and why promotion happened

Pitfalls

  • Split-brain is caused by lack of fencing: the old primary should be immediately and forcibly shut down (STONITH — “Shoot The Other Node In The Head”) when a failover is triggered.
  • Automatic failover without fencing is dangerous in any HA system — not just PostgreSQL.
  • synchronous_commit = on with a synchronous standby prevents data loss during failover, but doesn’t prevent split-brain if fencing fails.
  • Patroni uses distributed consensus (etcd/Consul/ZooKeeper) and DCS-based leader locks to prevent split-brain — this is why DCS is required.
  • After a split-brain, data reconciliation is a manual, complex process. Some writes from the old primary will be lost.

Resolution Approach

Immediately shut down the old primary (forcibly if necessary). Identify the divergence point using WAL comparison. Accept that writes to the old primary after the divergence point are lost. Optionally: manually extract diverged writes from the old primary’s WAL using pg_waldump for reconciliation.

This is a Pro lesson

Get every Learning Pathway and cookbook recipe — grounded in PostgreSQL source code, with diagnostics, fixes, and prevention for each topic.

Continue this lesson to learn:

  • Mitigation Actions
  • All 36 Learning Pathway lessons
  • 170+ cookbook recipes
  • Source-grounded diagnostics & fixes

Secure checkout Cancel anytime Source-grounded

Career Impact

This scenario builds production judgment and operational confidence under pressure.

Open Career Dashboard →

Keep going

Related & next steps

Was this helpful?

← All cookbook recipes