PostgreSQL interview prep

Train for the room and the pager. Real questions, the reasoning behind a strong answer, and live incident drills — not surface trivia.

Real interview and on-call questions — framed the way they're actually asked. Each one carries the model answer as a reasoning checklist, the signal the interviewer is testing, and the pitfall that ends interviews. Test yourself, then reveal.

0/73
My Career

Nothing matches these filters yet. Try a different tier, topic, or focus — or reset your progress.

Junior Constraints

What does a unique-violation (SQLSTATE 23505) actually mean, and how is it different from a primary-key violation?

What a strong answer covers
  • 23505 fires when an INSERT/UPDATE would create a duplicate in any UNIQUE constraint or unique index.
  • A primary key is just a UNIQUE + NOT NULL constraint, so PK collisions also surface as 23505.
  • The error detail names the constraint — read it to know which key collided.
  • Fixes: dedupe the input, use ON CONFLICT (upsert), or rethink the key if collisions are expected.
What they're really testing

Do you read the error detail instead of guessing which column collided?

Common pitfall

Saying "just catch and ignore it" without distinguishing a real duplicate from a retry.

Study this deeper Follow the related pathway

Junior DDL

Explain the difference between DELETE, TRUNCATE and DROP — and the operational consequences of each.

What a strong answer covers
  • DELETE removes rows, is transactional, fires triggers, leaves dead tuples for VACUUM.
  • TRUNCATE empties a table fast, is also transactional in PostgreSQL, but takes an ACCESS EXCLUSIVE lock and skips row triggers.
  • DROP removes the object entirely (schema + data).
  • TRUNCATE resets nothing you rely on unless you ask (RESTART IDENTITY for sequences).
What they're really testing

Awareness of locks, triggers and MVCC side effects — not just "they all remove data".

Common pitfall

Claiming TRUNCATE can't be rolled back (it can, inside a transaction).

Study this deeper Follow the related pathway

Junior Indexing

What is an index, and name a case where adding one will NOT help — or will hurt.

What a strong answer covers
  • An index is an ordered side structure that turns a scan into a lookup for selective predicates.
  • Won't help when the query returns most of the table (a seq scan is cheaper).
  • Hurts write throughput: every INSERT/UPDATE/DELETE maintains every index.
  • Low-cardinality columns, or functions/casts that don't match the index expression, defeat it.
What they're really testing

Understanding selectivity and the write-side cost, not "indexes make things fast".

Common pitfall

Indexing every column "to be safe".

Study this deeper Follow the related pathway

Junior Planner

Sequential scan vs index scan — is a sequential scan always a problem?

What a strong answer covers
  • A seq scan reads the whole table; an index scan walks the index then fetches rows.
  • For low-selectivity queries (returning a large fraction), a seq scan is the correct, cheaper plan.
  • The planner chooses based on cost estimates from table statistics.
  • Alarm only when a seq scan appears on a large table for a highly selective predicate — that signals a missing/unused index or stale stats.
What they're really testing

You don't treat "Seq Scan" in EXPLAIN as automatically bad.

Common pitfall

Forcing index usage with enable_seqscan=off in production.

Study this deeper Follow the related pathway

Junior Transactions

What is a transaction, and what exactly do COMMIT and ROLLBACK guarantee?

What a strong answer covers
  • A transaction groups statements into an all-or-nothing unit with ACID guarantees.
  • COMMIT makes changes durable and visible to other sessions.
  • ROLLBACK discards every change since BEGIN (or the last SAVEPOINT).
  • After an error mid-transaction, PostgreSQL aborts it — you must ROLLBACK before continuing (25P02).
What they're really testing

Knowing PostgreSQL aborts the whole transaction on error, unlike some other engines.

Common pitfall

Expecting statements after an error to keep working without rollback.

Study this deeper Follow the related pathway

Mid Performance

A query that was fast yesterday is slow today. Walk me through your diagnosis.

What a strong answer covers
  • Reproduce with EXPLAIN (ANALYZE, BUFFERS) and compare to a known-good plan.
  • Check for a plan flip: stale statistics → ANALYZE; check pg_stat_user_tables for last_analyze.
  • Look for data growth crossing a planner threshold, parameter sniffing, or bloat.
  • Check contention: locks, autovacuum activity, I/O saturation, cache hit ratio.
  • Confirm nothing changed around it: new data volume, a dropped index, a config change.
What they're really testing

A systematic method (evidence → hypothesis → confirm), not random knob-turning.

Common pitfall

Jumping straight to "add an index" before reading the plan.

Study this deeper Follow the related pathway

Mid Locks

You see "deadlock detected" (40P01) in the logs. What happened, and how do you prevent recurrence?

What a strong answer covers
  • Two transactions each hold a lock the other needs; the deadlock detector aborts one victim after deadlock_timeout.
  • Read the log detail — it names both processes and the locks involved.
  • Prevent by acquiring locks in a consistent order across the codebase.
  • Keep transactions short; add a sensible retry on 40P01 for the victim.
  • Reduce lock scope (narrower updates, right isolation level).
What they're really testing

Prevention via lock ordering + retry, not just "rerun it".

Common pitfall

Raising deadlock_timeout and calling it fixed.

Study this deeper Follow the related pathway

Mid Connections

Connections are maxing out ("too many connections", 53300). Immediate action and root cause?

What a strong answer covers
  • Immediate: find the offenders in pg_stat_activity (idle in transaction, long-running, app leaks).
  • Terminate safely if needed (pg_terminate_backend) and free headroom.
  • Root cause: missing/oversized app pool, no PgBouncer, connection leaks, long idle-in-transaction.
  • Fix the architecture: pool with PgBouncer (transaction mode), cap app pools, set idle_in_transaction_session_timeout.
What they're really testing

Treating max_connections as a symptom, not the cure.

Common pitfall

Just raising max_connections — each backend costs memory and worsens contention.

Study this deeper Follow the related pathway

Mid MVCC

Explain MVCC. Why do dead tuples accumulate, and what removes them?

What a strong answer covers
  • Each write creates a new row version; readers see a snapshot, so readers never block writers.
  • Old versions become "dead" once no snapshot can see them.
  • VACUUM reclaims dead tuples and updates the visibility map; autovacuum does it continuously.
  • Unremoved dead tuples cause bloat and can pin the xmin horizon (long transactions, replication slots).
What they're really testing

Connecting MVCC → dead tuples → VACUUM → bloat as one system.

Common pitfall

Thinking UPDATE edits a row in place.

Study this deeper Follow the related pathway

Mid Vacuum

What is autovacuum, why might it fall behind, and what are the symptoms?

What a strong answer covers
  • Background workers that VACUUM/ANALYZE tables crossing scale-factor thresholds.
  • Falls behind on high-churn tables, too-conservative settings, long transactions holding xmin, or too few workers.
  • Symptoms: growing table/index bloat, slow queries, rising dead-tuple counts, transaction-age climbing.
  • Tune per-table autovacuum_vacuum_scale_factor / cost limits; kill xmin-pinning transactions.
What they're really testing

Knowing autovacuum is tunable per-table and what pins its progress.

Common pitfall

Disabling autovacuum to "save resources".

Study this deeper Follow the related pathway

Mid Planner

EXPLAIN vs EXPLAIN ANALYZE — what do you read first in the output?

What a strong answer covers
  • EXPLAIN shows the estimated plan; ANALYZE actually runs it and reports real time and rows.
  • First compare estimated vs actual rows — a big gap means bad statistics or correlation the planner can't see.
  • Find the node consuming the most actual time, and check loops × per-loop cost.
  • Add BUFFERS to see cache vs disk; watch for spills (external sort/hash) and rows-removed-by-filter.
What they're really testing

Reading estimate-vs-actual divergence, not just the total time.

Common pitfall

Running EXPLAIN ANALYZE on a destructive statement without a transaction.

Study this deeper Follow the related pathway

Mid Locks

A session is stuck "idle in transaction". Why is that dangerous and how do you handle it?

What a strong answer covers
  • An open transaction holds its snapshot, pinning the xmin horizon so VACUUM can't clean newer dead tuples → bloat.
  • It may also hold locks, blocking DDL and other writers.
  • Find them in pg_stat_activity (state = idle in transaction, with age).
  • Mitigate with idle_in_transaction_session_timeout; fix the app to commit/rollback promptly.
What they're really testing

Linking idle-in-transaction to both bloat and lock blocking.

Common pitfall

Only killing the session and not setting the timeout to prevent recurrence.

Study this deeper Follow the related pathway

Mid Isolation

Describe the isolation levels PostgreSQL supports and the anomalies each prevents.

What a strong answer covers
  • PostgreSQL offers Read Committed (default), Repeatable Read, and Serializable (Read Uncommitted behaves as Read Committed).
  • Read Committed: each statement sees a fresh snapshot; prevents dirty reads.
  • Repeatable Read: one snapshot per transaction; prevents non-repeatable and phantom reads, can fail with 40001.
  • Serializable (SSI): guarantees a serial order; may abort with serialization_failure — design for retry.
What they're really testing

Knowing higher levels trade throughput for correctness and require retry logic.

Common pitfall

Assuming Serializable means "locks everything" rather than optimistic SSI with retries.

Study this deeper Follow the related pathway

Senior Migrations

Design a zero-downtime way to add a NOT NULL column with a default to a 1 TB table.

Pro — Senior & Staff model answers

The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.

Unlock with Pro
Senior Vacuum

Transaction-ID wraparound is approaching. Explain the mechanism, the risk, and your remediation.

Pro — Senior & Staff model answers

The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.

Unlock with Pro
Senior Replication

Replica lag is growing under write load. Diagnose the causes and the levers you'd pull.

Pro — Senior & Staff model answers

The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.

Unlock with Pro
Senior Architecture

Partition a 5 TB time-series table, or just buy a bigger box? How do you decide?

Pro — Senior & Staff model answers

The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.

Unlock with Pro
Senior Planner

How does the planner estimate row counts, and what makes estimates go wrong?

Pro — Senior & Staff model answers

The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.

Unlock with Pro
Senior Indexing

A B-tree index is bloated and reads are slow. REINDEX vs pg_repack — and the online options?

Pro — Senior & Staff model answers

The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.

Unlock with Pro
Senior Connections

When do you need PgBouncer, and what are the transaction- vs session-pooling tradeoffs?

Pro — Senior & Staff model answers

The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.

Unlock with Pro
Staff Architecture

Design the HA and failover architecture for a payments database with an RPO of zero.

Pro — Senior & Staff model answers

The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.

Unlock with Pro
Staff Operations

Define your backup, PITR and disaster-recovery strategy — and how you'd test it.

Pro — Senior & Staff model answers

The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.

Unlock with Pro
Staff Operations

Roll out a major-version upgrade across a fleet with minimal risk and downtime.

Pro — Senior & Staff model answers

The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.

Unlock with Pro
Staff Operations

Set capacity and autovacuum policy for a multi-tenant SaaS with noisy neighbours.

Pro — Senior & Staff model answers

The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.

Unlock with Pro
Staff Incident

The primary is up but p99 latency jumped 10x with no deploy. Walk me through your response.

Pro — Senior & Staff model answers

The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.

Unlock with Pro
Mid Incident

You are paged for "What “Managed PostgreSQL” Actually Means: The Shared Responsibility Model". Walk me through triage, containment, root cause proof, and prevention.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Senior Incident

An incident starts from "Two Architectures: “Postgres-on-a-Disk” vs Cloud-Native Disaggregated Storage". What is your first 10-minute plan, then your 1-hour stabilization plan?

Pro — Senior & Staff model answers

The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.

Unlock with Pro
Mid Incident

For "AWS RDS for PostgreSQL: The Managed Baseline", outline your evidence-first response: diagnose, mitigate, verify, and prevent recurrence.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Incident

You inherit "AWS Aurora PostgreSQL: The Log-Is-the-Database Storage Engine" during on-call. Explain the safest path from symptom to durable fix.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Incident

You are paged for "Google Cloud SQL for PostgreSQL: Google’s Managed Baseline". Walk me through triage, containment, root cause proof, and prevention.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Migrations

An incident starts from "Google AlloyDB: Disaggregated Storage and a Columnar HTAP Engine". What is your first 10-minute plan, then your 1-hour stabilization plan?

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Incident

For "Azure Database for PostgreSQL Flexible Server: Community Postgres on Azure", outline your evidence-first response: diagnose, mitigate, verify, and prevent recurrence.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Incident

You inherit "Azure HorizonDB: Microsoft’s Cloud-Native Distributed PostgreSQL" during on-call. Explain the safest path from symptom to durable fix.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Incident

You are paged for "Neon: Serverless PostgreSQL, Scale-to-Zero, and Copy-on-Write Branching". Walk me through triage, containment, root cause proof, and prevention.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Incident

An incident starts from "Databricks Lakebase: Lakehouse-Native PostgreSQL OLTP". What is your first 10-minute plan, then your 1-hour stabilization plan?

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Senior Replication

For "High Availability & Failover Models Compared", outline your evidence-first response: diagnose, mitigate, verify, and prevent recurrence.

Pro — Senior & Staff model answers

The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.

Unlock with Pro
Mid Connections

You inherit "Read Scaling: Read Replicas vs Shared-Storage Replicas vs Read Pools" during on-call. Explain the safest path from symptom to durable fix.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Senior Replication

You are paged for "Backups, PITR, and the Durability You Actually Get". Walk me through triage, containment, root cause proof, and prevention.

Pro — Senior & Staff model answers

The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.

Unlock with Pro
Mid Connections

An incident starts from "Connections and Pooling: Why a Pooler Is Almost Mandatory". What is your first 10-minute plan, then your 1-hour stabilization plan?

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Locks

For "What You Give Up: No Superuser, Extension Allowlists, Locked GUCs, and Lock-In", outline your evidence-first response: diagnose, mitigate, verify, and prevent recurrence.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Incident

You inherit "Choosing a Managed PostgreSQL: A Decision Framework and Cost Model" during on-call. Explain the safest path from symptom to durable fix.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Replication

You are paged for "Cross-Region Disaster Recovery: RPO, RTO, and Async Replication Over the WAN". Walk me through triage, containment, root cause proof, and prevention.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Senior Replication

An incident starts from "Failover Slots: Keeping Logical Replication Alive Across a Failover (PostgreSQL 17)". What is your first 10-minute plan, then your 1-hour stabilization plan?

Pro — Senior & Staff model answers

The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.

Unlock with Pro
Mid Replication

For "Cascading Replication and Delayed Standbys: Shaping the Replica Topology", outline your evidence-first response: diagnose, mitigate, verify, and prevent recurrence.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Replication

You inherit "pgBackRest: How Incremental Backups and Point-in-Time Recovery Actually Work" during on-call. Explain the safest path from symptom to durable fix.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Incident

You are paged for "Cluster Managers: How Patroni and etcd Decide Who Is Primary". Walk me through triage, containment, root cause proof, and prevention.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Incident

An incident starts from "The Background Processes: Who Actually Runs a PostgreSQL Server". What is your first 10-minute plan, then your 1-hour stabilization plan?

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Incident

For "Memory Contexts: Why PostgreSQL Almost Never Calls free()", outline your evidence-first response: diagnose, mitigate, verify, and prevent recurrence.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Incident

You inherit "How PostgreSQL Remembers Who Committed: clog and the SLRU Caches" during on-call. Explain the safest path from symptom to durable fix.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Incident

You are paged for "How a B-tree Splits Without Stopping the World: Lehman-Yao in PostgreSQL". Walk me through triage, containment, root cause proof, and prevention.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Incident

An incident starts from "One Tuple at a Time: How the Volcano Executor Runs Your Plan". What is your first 10-minute plan, then your 1-hour stabilization plan?

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Replication

For "Anatomy of a WAL Record: What PostgreSQL Writes Before It Touches Your Data", outline your evidence-first response: diagnose, mitigate, verify, and prevent recurrence.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Migrations

You inherit "Tuple Deforming: Why Column Order Quietly Changes Your Row Size" during on-call. Explain the safest path from symptom to durable fix.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Locks

You are paged for "The Lock Fast-Path: How PostgreSQL Avoids a Traffic Jam at the Lock Table". Walk me through triage, containment, root cause proof, and prevention.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Incident

An incident starts from "Paths vs Plans: How the Optimizer Chooses, Then Commits". What is your first 10-minute plan, then your 1-hour stabilization plan?

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Incident

For "From SQL Text to Result: The Parse, Plan, Execute Pipeline", outline your evidence-first response: diagnose, mitigate, verify, and prevent recurrence.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Incident

You inherit "Heap Pruning and the xmin Horizon: How a Plain SELECT Cleans Up" during on-call. Explain the safest path from symptom to durable fix.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Incident

You are paged for "ANALYZE Internals: How PostgreSQL Samples Your Data to Guess Plans". Walk me through triage, containment, root cause proof, and prevention.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Incident

An incident starts from "PERMISSION DENIED — Diagnosing SQLSTATE 42501 insufficient_privilege". What is your first 10-minute plan, then your 1-hour stabilization plan?

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Connections

For "Mid-Session Connection Failure — SQLSTATE 08006 (pgBouncer / NAT / TCP Keepalive)", outline your evidence-first response: diagnose, mitigate, verify, and prevent recurrence.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Connections

You inherit "Connection Routing for HA: VIPs, Pooler, and Read Scaling" during on-call. Explain the safest path from symptom to durable fix.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Senior Replication

You are paged for "Automatic Failover, Fencing, and Split-Brain". Walk me through triage, containment, root cause proof, and prevention.

Pro — Senior & Staff model answers

The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.

Unlock with Pro
Mid Incident

An incident starts from "pg_rewind: Reusing a Diverged Old Primary". What is your first 10-minute plan, then your 1-hour stabilization plan?

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Replication

For "WAL Archiving and Point-in-Time Recovery", outline your evidence-first response: diagnose, mitigate, verify, and prevent recurrence.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Senior Replication

You inherit "Synchronous Replication and Quorum Commit" during on-call. Explain the safest path from symptom to durable fix.

Pro — Senior & Staff model answers

The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.

Unlock with Pro
Mid Replication

You are paged for "Hot Standby and Recovery Conflicts". Walk me through triage, containment, root cause proof, and prevention.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Replication

An incident starts from "Replication Slots: Guaranteeing WAL Retention". What is your first 10-minute plan, then your 1-hour stabilization plan?

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Replication

For "Logical Replication and Logical Decoding", outline your evidence-first response: diagnose, mitigate, verify, and prevent recurrence.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Replication

You inherit "Streaming Replication: walsender and walreceiver" during on-call. Explain the safest path from symptom to durable fix.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Vacuum

You are paged for "Online VACUUM, REINDEX, and Table Rewrites". Walk me through triage, containment, root cause proof, and prevention.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Replication

An incident starts from "Blue/Green and Logical Replication Upgrades". What is your first 10-minute plan, then your 1-hour stabilization plan?

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Incident

For "Batched Backfills: Updating Millions of Rows Safely", outline your evidence-first response: diagnose, mitigate, verify, and prevent recurrence.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Mid Migrations

You inherit "Expand/Contract: The Pattern Behind Every Safe Migration" during on-call. Explain the safest path from symptom to durable fix.

What a strong answer covers
  • State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
  • Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
  • Choose the least-risk fix path and explain rollback if the fix does not hold.
  • Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
What they're really testing

Can you run a safe incident loop under pressure and justify each decision with evidence?

Common pitfall

Jumping to a permanent fix before stabilizing user impact and capturing proof.

Study this deeper Open source scenario Follow the related pathway

Simulator mode

Run the on-call loop

Practise structured response on cookbook scenarios: triage, contain, validate, prevent. Grade each run and reshuffle for repetition.

Grade your scenario runs

Full Page Writes Inflating WAL Size

A DBA notices WAL generation is 800 MB/min despite an average write rate of only 150 MB/min of actual data. The WAL is growing 5x faster than expected. Archiving is…

Triage

Confirm blast radius and active impact first (error rate, latency, failed writes, blocked sessions) before touching production state.

Contain

Apply the lowest-risk containment (cancel runaway query, throttle workload, isolate noisy path, fail over only if needed).

Validate

Prove recovery with objective signals (p95/p99, queue depth, lock waits, replication lag) and keep rollback criteria explicit.

Prevent

Create one durable prevention artifact: runbook step, alert threshold, migration guardrail, or capacity policy update.

Open the full cookbook scenario

bgwriter_maxwritten_clean Counter Rising

A DBA notices that the buffers_maxwritten_clean counter in pg_stat_bgwriter is steadily rising. This counter increments every time the background writer stops a cleaning scan because it hit bgwriter_lru_maxpages. It mean

Triage

Confirm blast radius and active impact first (error rate, latency, failed writes, blocked sessions) before touching production state.

Contain

Apply the lowest-risk containment (cancel runaway query, throttle workload, isolate noisy path, fail over only if needed).

Validate

Prove recovery with objective signals (p95/p99, queue depth, lock waits, replication lag) and keep rollback criteria explicit.

Prevent

Create one durable prevention artifact: runbook step, alert threshold, migration guardrail, or capacity policy update.

Open the full cookbook scenario

Synchronous Replication and Quorum Commit

How synchronous replication works: synchronous_commit levels, synchronous_standby_names, FIRST vs ANY quorum, and the durability/availability trade-off.

Triage

Confirm blast radius and active impact first (error rate, latency, failed writes, blocked sessions) before touching production state.

Contain

Apply the lowest-risk containment (cancel runaway query, throttle workload, isolate noisy path, fail over only if needed).

Validate

Prove recovery with objective signals (p95/p99, queue depth, lock waits, replication lag) and keep rollback criteria explicit.

Prevent

Create one durable prevention artifact: runbook step, alert threshold, migration guardrail, or capacity policy update.

Open the full cookbook scenario

ANALYZE Internals: How PostgreSQL Samples Your Data to Guess Plans

How PostgreSQL's ANALYZE works at the source level: two-stage reservoir sampling, the 300x rule, MCV lists, equi-depth histograms, n_distinct, correlation, and how it differs from Oracle DBMS_STATS.

Triage

Confirm blast radius and active impact first (error rate, latency, failed writes, blocked sessions) before touching production state.

Contain

Apply the lowest-risk containment (cancel runaway query, throttle workload, isolate noisy path, fail over only if needed).

Validate

Prove recovery with objective signals (p95/p99, queue depth, lock waits, replication lag) and keep rollback criteria explicit.

Prevent

Create one durable prevention artifact: runbook step, alert threshold, migration guardrail, or capacity policy update.

Open the full cookbook scenario

Detach and attach partitions with minimal locking

Rolling off old data or loading a prepared partition should not lock the whole table. Use CONCURRENTLY-style attach/detach.

Triage

Confirm blast radius and active impact first (error rate, latency, failed writes, blocked sessions) before touching production state.

Contain

Apply the lowest-risk containment (cancel runaway query, throttle workload, isolate noisy path, fail over only if needed).

Validate

Prove recovery with objective signals (p95/p99, queue depth, lock waits, replication lag) and keep rollback criteria explicit.

Prevent

Create one durable prevention artifact: runbook step, alert threshold, migration guardrail, or capacity policy update.

Open the full cookbook scenario

Multixact wraparound approaching

age(datminmxid) in pg_database is growing — multixact IDs are approaching their 32-bit wraparound limit. This is a separate counter from transaction ID wr

Triage

Confirm blast radius and active impact first (error rate, latency, failed writes, blocked sessions) before touching production state.

Contain

Apply the lowest-risk containment (cancel runaway query, throttle workload, isolate noisy path, fail over only if needed).

Validate

Prove recovery with objective signals (p95/p99, queue depth, lock waits, replication lag) and keep rollback criteria explicit.

Prevent

Create one durable prevention artifact: runbook step, alert threshold, migration guardrail, or capacity policy update.

Open the full cookbook scenario

Read Scaling: Read Replicas vs Shared-Storage Replicas vs Read Pools

How to add read capacity in managed PostgreSQL: asynchronous streaming read replicas (RDS, Cloud SQL, Flexible Server) versus shared-storage replicas and read pools (Aurora, AlloyDB), and the consistency trade-offs of…

Triage

Confirm blast radius and active impact first (error rate, latency, failed writes, blocked sessions) before touching production state.

Contain

Apply the lowest-risk containment (cancel runaway query, throttle workload, isolate noisy path, fail over only if needed).

Validate

Prove recovery with objective signals (p95/p99, queue depth, lock waits, replication lag) and keep rollback criteria explicit.

Prevent

Create one durable prevention artifact: runbook step, alert threshold, migration guardrail, or capacity policy update.

Open the full cookbook scenario

What You Give Up: No Superuser, Extension Allowlists, Locked GUCs, and Lock-In

The real constraints of managed PostgreSQL: losing superuser, extension allowlists, parameters you cannot change, the absence of OS access, and how cloud-native forks deepen lock-in — plus how to plan…

Triage

Confirm blast radius and active impact first (error rate, latency, failed writes, blocked sessions) before touching production state.

Contain

Apply the lowest-risk containment (cancel runaway query, throttle workload, isolate noisy path, fail over only if needed).

Validate

Prove recovery with objective signals (p95/p99, queue depth, lock waits, replication lag) and keep rollback criteria explicit.

Prevent

Create one durable prevention artifact: runbook step, alert threshold, migration guardrail, or capacity policy update.

Open the full cookbook scenario

PERMISSION DENIED — Diagnosing SQLSTATE 42501 insufficient_privilege

Step-by-step playbook to diagnose SQLSTATE 42501 — permission denied for table, schema, function, or sequence. Covers the full PostgreSQL privilege layer stack and default privileges.

Triage

Confirm blast radius and active impact first (error rate, latency, failed writes, blocked sessions) before touching production state.

Contain

Apply the lowest-risk containment (cancel runaway query, throttle workload, isolate noisy path, fail over only if needed).

Validate

Prove recovery with objective signals (p95/p99, queue depth, lock waits, replication lag) and keep rollback criteria explicit.

Prevent

Create one durable prevention artifact: runbook step, alert threshold, migration guardrail, or capacity policy update.

Open the full cookbook scenario

Partition a growing table with declarative partitioning

A single table is too big to vacuum, index, or prune efficiently. Range partitioning by time makes maintenance and queries fast.

Triage

Confirm blast radius and active impact first (error rate, latency, failed writes, blocked sessions) before touching production state.

Contain

Apply the lowest-risk containment (cancel runaway query, throttle workload, isolate noisy path, fail over only if needed).

Validate

Prove recovery with objective signals (p95/p99, queue depth, lock waits, replication lag) and keep rollback criteria explicit.

Prevent

Create one durable prevention artifact: runbook step, alert threshold, migration guardrail, or capacity policy update.

Open the full cookbook scenario

The Lock Fast-Path: How PostgreSQL Avoids a Traffic Jam at the Lock Table

How PostgreSQL's fast-path locking records weak relation locks in per-backend slots: LOCALLOCK, the 16 fast-path slots, strong-lock transfers, and reading pg_locks.fastpath.

Triage

Confirm blast radius and active impact first (error rate, latency, failed writes, blocked sessions) before touching production state.

Contain

Apply the lowest-risk containment (cancel runaway query, throttle workload, isolate noisy path, fail over only if needed).

Validate

Prove recovery with objective signals (p95/p99, queue depth, lock waits, replication lag) and keep rollback criteria explicit.

Prevent

Create one durable prevention artifact: runbook step, alert threshold, migration guardrail, or capacity policy update.

Open the full cookbook scenario

pg_rewind: Reusing a Diverged Old Primary

How pg_rewind works: timeline divergence after failover, finding changed blocks since the divergence point, and rejoining an old primary as a standby cheaply.

Triage

Confirm blast radius and active impact first (error rate, latency, failed writes, blocked sessions) before touching production state.

Contain

Apply the lowest-risk containment (cancel runaway query, throttle workload, isolate noisy path, fail over only if needed).

Validate

Prove recovery with objective signals (p95/p99, queue depth, lock waits, replication lag) and keep rollback criteria explicit.

Prevent

Create one durable prevention artifact: runbook step, alert threshold, migration guardrail, or capacity policy update.

Open the full cookbook scenario

Applied practice

Incident drills

Real production symptoms pulled from the error library. Read the symptom, decide your first move, then reveal what it is and why — then grade yourself honestly.

Self-grade each call

The server reports SQLSTATE 2201X (invalid_row_count_in_result_offset_clause), a condition in the Data Exception class. The error is written to the server log and returned to the client carrying SQLSTATE 2201X.

2201X · invalid_row_count_in_result_offset_clause

Root cause

2201X belongs to Class 22 — Data Exception. In this class, a value is invalid for its data type or the operation applied to it. The first two characters (22) identify the error class, so application code can match the whole class via 22000 when the specific code is not needed.

Open the full recovery & prevention Build the underlying pathway skill

The role lacks a required privilege on the object, schema, or database. The error is written to the server log and returned to the client carrying SQLSTATE 42501.

42501 · insufficient_privilege

Root cause

Access control rejected the operation at one of the privilege layers: database CONNECT, schema USAGE, or the object privilege itself. Missing schema USAGE is by far the most common cause. Common causes: Object privilege granted but schema USAGE forgotten. Past GRANT ... ON ALL TABLES does not cover future tables. PostgreSQL 15+ no longer auto-grants…

Open the full recovery & prevention Build the underlying pathway skill

The server reports SQLSTATE 2200C (invalid_use_of_escape_character), a condition in the Data Exception class. The error is written to the server log and returned to the client carrying SQLSTATE 2200C.

2200C · invalid_use_of_escape_character

Root cause

2200C belongs to Class 22 — Data Exception. In this class, a value is invalid for its data type or the operation applied to it. The first two characters (22) identify the error class, so application code can match the whole class via 22000 when the specific code is not needed.

Open the full recovery & prevention Build the underlying pathway skill

The server reports SQLSTATE 42602 (invalid_name), a condition in the Syntax Error or Access Rule Violation class. The error is written to the server log and returned to the client carrying SQLSTATE 42602.

42602 · invalid_name

Root cause

42602 belongs to Class 42 — Syntax Error or Access Rule Violation. In this class, the SQL is malformed or references a missing object or privilege. The first two characters (42) identify the error class, so application code can match the whole class via 42000 when the specific code is not needed.

Open the full recovery & prevention Build the underlying pathway skill

The server reports SQLSTATE 22016 (invalid_argument_for_nth_value_function), a condition in the Data Exception class. The error is written to the server log and returned to the client carrying SQLSTATE 22016.

22016 · invalid_argument_for_nth_value_function

Root cause

22016 belongs to Class 22 — Data Exception. In this class, a value is invalid for its data type or the operation applied to it. The first two characters (22) identify the error class, so application code can match the whole class via 22000 when the specific code is not needed.

Open the full recovery & prevention Build the underlying pathway skill

Two or more transactions blocked each other and PostgreSQL aborted one to break the deadlock. The error is written to the server log and returned to the client carrying SQLSTATE 40P01.

40P01 · deadlock_detected

Root cause

The deadlock detector found a cycle in lock waits and canceled the cheapest victim transaction. Common causes: Transactions acquiring the same rows or tables in different orders. Missing indexes on foreign keys causing wider locks. Long transactions holding locks while waiting.

Open the full recovery & prevention Build the underlying pathway skill

Conditions: ERROR: connection to server was lost SSL connection has been closed unexpectedly Disconnects happen during idle periods (30 seconds to several minutes), not under load pgBouncer logs show closing because: server idle timeout or server login failed pg_stat_activity shows…

08006 · connection_failure

Root cause

08006 fires when the TCP connection to a PostgreSQL backend is severed mid-session — the connection existed and was valid, then died without a clean close. Unlike 08001 (could not establish connection), 08006 means you had a connection and lost it. Root causes by layer: pgBouncer server_idle_timeout — pgBouncer drops idle server-side connections after a…

Open the full recovery & prevention Build the underlying pathway skill

The server reports SQLSTATE 2200D (invalid_escape_octet), a condition in the Data Exception class. The error is written to the server log and returned to the client carrying SQLSTATE 2200D.

2200D · invalid_escape_octet

Root cause

2200D belongs to Class 22 — Data Exception. In this class, a value is invalid for its data type or the operation applied to it. The first two characters (22) identify the error class, so application code can match the whole class via 22000 when the specific code is not needed.

Open the full recovery & prevention Build the underlying pathway skill

The error appears verbatim in client logs and pg_log: FATAL: sorry, too many clients already Alternatively, when max_connections is set per database or user: FATAL: too many connections for role "app_user" FATAL: too many connections for database "production_db" Observable symptoms…

53300 · too_many_connections

Root cause

PostgreSQL maintains a fixed-size connection slot array allocated at startup. When all slots are filled, there is no backpressure mechanism — new connection attempts are immediately refused with FATAL: sorry, too many clients already. Common causes No connection pooler: The application opens one connection per thread/worker/request directly to PostgreSQL. Under load, this exhausts slots fast.…

Open the full recovery & prevention Build the underlying pathway skill

A relation with that name already exists. The error is written to the server log and returned to the client carrying SQLSTATE 42P07.

42P07 · duplicate_table

Root cause

CREATE TABLE/INDEX/etc. targets a name already in use.

Open the full recovery & prevention Build the underlying pathway skill

The server reports SQLSTATE 2200N (invalid_xml_content), a condition in the Data Exception class. The error is written to the server log and returned to the client carrying SQLSTATE 2200N.

2200N · invalid_xml_content

Root cause

2200N belongs to Class 22 — Data Exception. In this class, a value is invalid for its data type or the operation applied to it. The first two characters (22) identify the error class, so application code can match the whole class via 22000 when the specific code is not needed.

Open the full recovery & prevention Build the underlying pathway skill

The server reports SQLSTATE 22025 (invalid_escape_sequence), a condition in the Data Exception class. The error is written to the server log and returned to the client carrying SQLSTATE 22025.

22025 · invalid_escape_sequence

Root cause

22025 belongs to Class 22 — Data Exception. In this class, a value is invalid for its data type or the operation applied to it. The first two characters (22) identify the error class, so application code can match the whole class via 22000 when the specific code is not needed.

Open the full recovery & prevention Build the underlying pathway skill