Train for the room and the pager. Real questions, the reasoning behind a strong answer, and live incident drills — not surface trivia.
Real interview and on-call questions — framed the way they're actually asked. Each one carries the model answer as a reasoning checklist, the signal the interviewer is testing, and the pitfall that ends interviews. Test yourself, then reveal.
Design a zero-downtime way to add a NOT NULL column with a default to a 1 TB table.
lockPro — Senior & Staff model answers
The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.
Transaction-ID wraparound is approaching. Explain the mechanism, the risk, and your remediation.
lockPro — Senior & Staff model answers
The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.
Replica lag is growing under write load. Diagnose the causes and the levers you'd pull.
lockPro — Senior & Staff model answers
The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.
Partition a 5 TB time-series table, or just buy a bigger box? How do you decide?
lockPro — Senior & Staff model answers
The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.
How does the planner estimate row counts, and what makes estimates go wrong?
lockPro — Senior & Staff model answers
The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.
A B-tree index is bloated and reads are slow. REINDEX vs pg_repack — and the online options?
lockPro — Senior & Staff model answers
The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.
When do you need PgBouncer, and what are the transaction- vs session-pooling tradeoffs?
lockPro — Senior & Staff model answers
The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.
Design the HA and failover architecture for a payments database with an RPO of zero.
lockPro — Senior & Staff model answers
The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.
Define your backup, PITR and disaster-recovery strategy — and how you'd test it.
lockPro — Senior & Staff model answers
The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.
Roll out a major-version upgrade across a fleet with minimal risk and downtime.
lockPro — Senior & Staff model answers
The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.
Set capacity and autovacuum policy for a multi-tenant SaaS with noisy neighbours.
lockPro — Senior & Staff model answers
The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.
The primary is up but p99 latency jumped 10x with no deploy. Walk me through your response.
lockPro — Senior & Staff model answers
The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.
You are paged for "What “Managed PostgreSQL” Actually Means: The Shared Responsibility Model". Walk me through triage, containment, root cause proof, and prevention.
checklistWhat a strong answer covers
State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
Choose the least-risk fix path and explain rollback if the fix does not hold.
Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
sensorsWhat they're really testing
Can you run a safe incident loop under pressure and justify each decision with evidence?
warningCommon pitfall
Jumping to a permanent fix before stabilizing user impact and capturing proof.
An incident starts from "Two Architectures: “Postgres-on-a-Disk” vs Cloud-Native Disaggregated Storage". What is your first 10-minute plan, then your 1-hour stabilization plan?
lockPro — Senior & Staff model answers
The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.
An incident starts from "Google AlloyDB: Disaggregated Storage and a Columnar HTAP Engine". What is your first 10-minute plan, then your 1-hour stabilization plan?
checklistWhat a strong answer covers
State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
Choose the least-risk fix path and explain rollback if the fix does not hold.
Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
sensorsWhat they're really testing
Can you run a safe incident loop under pressure and justify each decision with evidence?
warningCommon pitfall
Jumping to a permanent fix before stabilizing user impact and capturing proof.
For "Azure Database for PostgreSQL Flexible Server: Community Postgres on Azure", outline your evidence-first response: diagnose, mitigate, verify, and prevent recurrence.
checklistWhat a strong answer covers
State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
Choose the least-risk fix path and explain rollback if the fix does not hold.
Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
sensorsWhat they're really testing
Can you run a safe incident loop under pressure and justify each decision with evidence?
warningCommon pitfall
Jumping to a permanent fix before stabilizing user impact and capturing proof.
You are paged for "Neon: Serverless PostgreSQL, Scale-to-Zero, and Copy-on-Write Branching". Walk me through triage, containment, root cause proof, and prevention.
checklistWhat a strong answer covers
State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
Choose the least-risk fix path and explain rollback if the fix does not hold.
Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
sensorsWhat they're really testing
Can you run a safe incident loop under pressure and justify each decision with evidence?
warningCommon pitfall
Jumping to a permanent fix before stabilizing user impact and capturing proof.
An incident starts from "Databricks Lakebase: Lakehouse-Native PostgreSQL OLTP". What is your first 10-minute plan, then your 1-hour stabilization plan?
checklistWhat a strong answer covers
State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
Choose the least-risk fix path and explain rollback if the fix does not hold.
Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
sensorsWhat they're really testing
Can you run a safe incident loop under pressure and justify each decision with evidence?
warningCommon pitfall
Jumping to a permanent fix before stabilizing user impact and capturing proof.
For "High Availability & Failover Models Compared", outline your evidence-first response: diagnose, mitigate, verify, and prevent recurrence.
lockPro — Senior & Staff model answers
The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.
You inherit "Read Scaling: Read Replicas vs Shared-Storage Replicas vs Read Pools" during on-call. Explain the safest path from symptom to durable fix.
checklistWhat a strong answer covers
State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
Choose the least-risk fix path and explain rollback if the fix does not hold.
Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
sensorsWhat they're really testing
Can you run a safe incident loop under pressure and justify each decision with evidence?
warningCommon pitfall
Jumping to a permanent fix before stabilizing user impact and capturing proof.
You are paged for "Backups, PITR, and the Durability You Actually Get". Walk me through triage, containment, root cause proof, and prevention.
lockPro — Senior & Staff model answers
The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.
An incident starts from "Connections and Pooling: Why a Pooler Is Almost Mandatory". What is your first 10-minute plan, then your 1-hour stabilization plan?
checklistWhat a strong answer covers
State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
Choose the least-risk fix path and explain rollback if the fix does not hold.
Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
sensorsWhat they're really testing
Can you run a safe incident loop under pressure and justify each decision with evidence?
warningCommon pitfall
Jumping to a permanent fix before stabilizing user impact and capturing proof.
For "What You Give Up: No Superuser, Extension Allowlists, Locked GUCs, and Lock-In", outline your evidence-first response: diagnose, mitigate, verify, and prevent recurrence.
checklistWhat a strong answer covers
State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
Choose the least-risk fix path and explain rollback if the fix does not hold.
Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
sensorsWhat they're really testing
Can you run a safe incident loop under pressure and justify each decision with evidence?
warningCommon pitfall
Jumping to a permanent fix before stabilizing user impact and capturing proof.
You are paged for "Cross-Region Disaster Recovery: RPO, RTO, and Async Replication Over the WAN". Walk me through triage, containment, root cause proof, and prevention.
checklistWhat a strong answer covers
State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
Choose the least-risk fix path and explain rollback if the fix does not hold.
Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
sensorsWhat they're really testing
Can you run a safe incident loop under pressure and justify each decision with evidence?
warningCommon pitfall
Jumping to a permanent fix before stabilizing user impact and capturing proof.
An incident starts from "Failover Slots: Keeping Logical Replication Alive Across a Failover (PostgreSQL 17)". What is your first 10-minute plan, then your 1-hour stabilization plan?
lockPro — Senior & Staff model answers
The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.
For "Cascading Replication and Delayed Standbys: Shaping the Replica Topology", outline your evidence-first response: diagnose, mitigate, verify, and prevent recurrence.
checklistWhat a strong answer covers
State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
Choose the least-risk fix path and explain rollback if the fix does not hold.
Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
sensorsWhat they're really testing
Can you run a safe incident loop under pressure and justify each decision with evidence?
warningCommon pitfall
Jumping to a permanent fix before stabilizing user impact and capturing proof.
You inherit "pgBackRest: How Incremental Backups and Point-in-Time Recovery Actually Work" during on-call. Explain the safest path from symptom to durable fix.
checklistWhat a strong answer covers
State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
Choose the least-risk fix path and explain rollback if the fix does not hold.
Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
sensorsWhat they're really testing
Can you run a safe incident loop under pressure and justify each decision with evidence?
warningCommon pitfall
Jumping to a permanent fix before stabilizing user impact and capturing proof.
You are paged for "Cluster Managers: How Patroni and etcd Decide Who Is Primary". Walk me through triage, containment, root cause proof, and prevention.
checklistWhat a strong answer covers
State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
Choose the least-risk fix path and explain rollback if the fix does not hold.
Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
sensorsWhat they're really testing
Can you run a safe incident loop under pressure and justify each decision with evidence?
warningCommon pitfall
Jumping to a permanent fix before stabilizing user impact and capturing proof.
An incident starts from "The Background Processes: Who Actually Runs a PostgreSQL Server". What is your first 10-minute plan, then your 1-hour stabilization plan?
checklistWhat a strong answer covers
State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
Choose the least-risk fix path and explain rollback if the fix does not hold.
Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
sensorsWhat they're really testing
Can you run a safe incident loop under pressure and justify each decision with evidence?
warningCommon pitfall
Jumping to a permanent fix before stabilizing user impact and capturing proof.
For "Memory Contexts: Why PostgreSQL Almost Never Calls free()", outline your evidence-first response: diagnose, mitigate, verify, and prevent recurrence.
checklistWhat a strong answer covers
State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
Choose the least-risk fix path and explain rollback if the fix does not hold.
Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
sensorsWhat they're really testing
Can you run a safe incident loop under pressure and justify each decision with evidence?
warningCommon pitfall
Jumping to a permanent fix before stabilizing user impact and capturing proof.
You are paged for "How a B-tree Splits Without Stopping the World: Lehman-Yao in PostgreSQL". Walk me through triage, containment, root cause proof, and prevention.
checklistWhat a strong answer covers
State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
Choose the least-risk fix path and explain rollback if the fix does not hold.
Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
sensorsWhat they're really testing
Can you run a safe incident loop under pressure and justify each decision with evidence?
warningCommon pitfall
Jumping to a permanent fix before stabilizing user impact and capturing proof.
An incident starts from "One Tuple at a Time: How the Volcano Executor Runs Your Plan". What is your first 10-minute plan, then your 1-hour stabilization plan?
checklistWhat a strong answer covers
State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
Choose the least-risk fix path and explain rollback if the fix does not hold.
Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
sensorsWhat they're really testing
Can you run a safe incident loop under pressure and justify each decision with evidence?
warningCommon pitfall
Jumping to a permanent fix before stabilizing user impact and capturing proof.
For "Anatomy of a WAL Record: What PostgreSQL Writes Before It Touches Your Data", outline your evidence-first response: diagnose, mitigate, verify, and prevent recurrence.
checklistWhat a strong answer covers
State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
Choose the least-risk fix path and explain rollback if the fix does not hold.
Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
sensorsWhat they're really testing
Can you run a safe incident loop under pressure and justify each decision with evidence?
warningCommon pitfall
Jumping to a permanent fix before stabilizing user impact and capturing proof.
You are paged for "The Lock Fast-Path: How PostgreSQL Avoids a Traffic Jam at the Lock Table". Walk me through triage, containment, root cause proof, and prevention.
checklistWhat a strong answer covers
State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
Choose the least-risk fix path and explain rollback if the fix does not hold.
Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
sensorsWhat they're really testing
Can you run a safe incident loop under pressure and justify each decision with evidence?
warningCommon pitfall
Jumping to a permanent fix before stabilizing user impact and capturing proof.
An incident starts from "Paths vs Plans: How the Optimizer Chooses, Then Commits". What is your first 10-minute plan, then your 1-hour stabilization plan?
checklistWhat a strong answer covers
State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
Choose the least-risk fix path and explain rollback if the fix does not hold.
Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
sensorsWhat they're really testing
Can you run a safe incident loop under pressure and justify each decision with evidence?
warningCommon pitfall
Jumping to a permanent fix before stabilizing user impact and capturing proof.
For "From SQL Text to Result: The Parse, Plan, Execute Pipeline", outline your evidence-first response: diagnose, mitigate, verify, and prevent recurrence.
checklistWhat a strong answer covers
State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
Choose the least-risk fix path and explain rollback if the fix does not hold.
Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
sensorsWhat they're really testing
Can you run a safe incident loop under pressure and justify each decision with evidence?
warningCommon pitfall
Jumping to a permanent fix before stabilizing user impact and capturing proof.
You are paged for "ANALYZE Internals: How PostgreSQL Samples Your Data to Guess Plans". Walk me through triage, containment, root cause proof, and prevention.
checklistWhat a strong answer covers
State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
Choose the least-risk fix path and explain rollback if the fix does not hold.
Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
sensorsWhat they're really testing
Can you run a safe incident loop under pressure and justify each decision with evidence?
warningCommon pitfall
Jumping to a permanent fix before stabilizing user impact and capturing proof.
An incident starts from "PERMISSION DENIED — Diagnosing SQLSTATE 42501 insufficient_privilege". What is your first 10-minute plan, then your 1-hour stabilization plan?
checklistWhat a strong answer covers
State the first 5-minute triage checks (blast radius, blocking queries, replication/IO/locks) and what decides escalation.
Define containment first (reduce impact), then diagnosis (prove root cause) using concrete SQL/metrics.
Choose the least-risk fix path and explain rollback if the fix does not hold.
Close with prevention: alert, guardrail, and runbook update so this incident class gets faster next time.
sensorsWhat they're really testing
Can you run a safe incident loop under pressure and justify each decision with evidence?
warningCommon pitfall
Jumping to a permanent fix before stabilizing user impact and capturing proof.
You are paged for "Automatic Failover, Fencing, and Split-Brain". Walk me through triage, containment, root cause proof, and prevention.
lockPro — Senior & Staff model answers
The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.
You inherit "Synchronous Replication and Quorum Commit" during on-call. Explain the safest path from symptom to durable fix.
lockPro — Senior & Staff model answers
The question is yours to practise free. Pro unlocks the reasoning checklist, the signal interviewers score, and the pitfall to avoid — for every Senior and Staff question.
Practise structured response on cookbook scenarios: triage, contain, validate, prevent. Grade each run and reshuffle for repetition.
smart_toyGrade your scenario runs
flareFull Page Writes Inflating WAL Size
A DBA notices WAL generation is 800 MB/min despite an average write rate of only 150 MB/min of actual data. The WAL is growing 5x faster than expected. Archiving is…
searchTriage
Confirm blast radius and active impact first (error rate, latency, failed writes, blocked sessions) before touching production state.
health_and_safetyContain
Apply the lowest-risk containment (cancel runaway query, throttle workload, isolate noisy path, fail over only if needed).
analyticsValidate
Prove recovery with objective signals (p95/p99, queue depth, lock waits, replication lag) and keep rollback criteria explicit.
shieldPrevent
Create one durable prevention artifact: runbook step, alert threshold, migration guardrail, or capacity policy update.
A DBA notices that the buffers_maxwritten_clean counter in pg_stat_bgwriter is steadily rising. This counter increments every time the background writer stops a cleaning scan because it hit bgwriter_lru_maxpages. It mean
searchTriage
Confirm blast radius and active impact first (error rate, latency, failed writes, blocked sessions) before touching production state.
health_and_safetyContain
Apply the lowest-risk containment (cancel runaway query, throttle workload, isolate noisy path, fail over only if needed).
analyticsValidate
Prove recovery with objective signals (p95/p99, queue depth, lock waits, replication lag) and keep rollback criteria explicit.
shieldPrevent
Create one durable prevention artifact: runbook step, alert threshold, migration guardrail, or capacity policy update.
How synchronous replication works: synchronous_commit levels, synchronous_standby_names, FIRST vs ANY quorum, and the durability/availability trade-off.
searchTriage
Confirm blast radius and active impact first (error rate, latency, failed writes, blocked sessions) before touching production state.
health_and_safetyContain
Apply the lowest-risk containment (cancel runaway query, throttle workload, isolate noisy path, fail over only if needed).
analyticsValidate
Prove recovery with objective signals (p95/p99, queue depth, lock waits, replication lag) and keep rollback criteria explicit.
shieldPrevent
Create one durable prevention artifact: runbook step, alert threshold, migration guardrail, or capacity policy update.
flareANALYZE Internals: How PostgreSQL Samples Your Data to Guess Plans
How PostgreSQL's ANALYZE works at the source level: two-stage reservoir sampling, the 300x rule, MCV lists, equi-depth histograms, n_distinct, correlation, and how it differs from Oracle DBMS_STATS.
searchTriage
Confirm blast radius and active impact first (error rate, latency, failed writes, blocked sessions) before touching production state.
health_and_safetyContain
Apply the lowest-risk containment (cancel runaway query, throttle workload, isolate noisy path, fail over only if needed).
analyticsValidate
Prove recovery with objective signals (p95/p99, queue depth, lock waits, replication lag) and keep rollback criteria explicit.
shieldPrevent
Create one durable prevention artifact: runbook step, alert threshold, migration guardrail, or capacity policy update.
age(datminmxid) in pg_database is growing — multixact IDs are approaching their 32-bit wraparound limit. This is a separate counter from transaction ID wr
searchTriage
Confirm blast radius and active impact first (error rate, latency, failed writes, blocked sessions) before touching production state.
health_and_safetyContain
Apply the lowest-risk containment (cancel runaway query, throttle workload, isolate noisy path, fail over only if needed).
analyticsValidate
Prove recovery with objective signals (p95/p99, queue depth, lock waits, replication lag) and keep rollback criteria explicit.
shieldPrevent
Create one durable prevention artifact: runbook step, alert threshold, migration guardrail, or capacity policy update.
flareRead Scaling: Read Replicas vs Shared-Storage Replicas vs Read Pools
How to add read capacity in managed PostgreSQL: asynchronous streaming read replicas (RDS, Cloud SQL, Flexible Server) versus shared-storage replicas and read pools (Aurora, AlloyDB), and the consistency trade-offs of…
searchTriage
Confirm blast radius and active impact first (error rate, latency, failed writes, blocked sessions) before touching production state.
health_and_safetyContain
Apply the lowest-risk containment (cancel runaway query, throttle workload, isolate noisy path, fail over only if needed).
analyticsValidate
Prove recovery with objective signals (p95/p99, queue depth, lock waits, replication lag) and keep rollback criteria explicit.
shieldPrevent
Create one durable prevention artifact: runbook step, alert threshold, migration guardrail, or capacity policy update.
flareWhat You Give Up: No Superuser, Extension Allowlists, Locked GUCs, and Lock-In
The real constraints of managed PostgreSQL: losing superuser, extension allowlists, parameters you cannot change, the absence of OS access, and how cloud-native forks deepen lock-in — plus how to plan…
searchTriage
Confirm blast radius and active impact first (error rate, latency, failed writes, blocked sessions) before touching production state.
health_and_safetyContain
Apply the lowest-risk containment (cancel runaway query, throttle workload, isolate noisy path, fail over only if needed).
analyticsValidate
Prove recovery with objective signals (p95/p99, queue depth, lock waits, replication lag) and keep rollback criteria explicit.
shieldPrevent
Create one durable prevention artifact: runbook step, alert threshold, migration guardrail, or capacity policy update.
Step-by-step playbook to diagnose SQLSTATE 42501 — permission denied for table, schema, function, or sequence. Covers the full PostgreSQL privilege layer stack and default privileges.
searchTriage
Confirm blast radius and active impact first (error rate, latency, failed writes, blocked sessions) before touching production state.
health_and_safetyContain
Apply the lowest-risk containment (cancel runaway query, throttle workload, isolate noisy path, fail over only if needed).
analyticsValidate
Prove recovery with objective signals (p95/p99, queue depth, lock waits, replication lag) and keep rollback criteria explicit.
shieldPrevent
Create one durable prevention artifact: runbook step, alert threshold, migration guardrail, or capacity policy update.
flareThe Lock Fast-Path: How PostgreSQL Avoids a Traffic Jam at the Lock Table
How PostgreSQL's fast-path locking records weak relation locks in per-backend slots: LOCALLOCK, the 16 fast-path slots, strong-lock transfers, and reading pg_locks.fastpath.
searchTriage
Confirm blast radius and active impact first (error rate, latency, failed writes, blocked sessions) before touching production state.
health_and_safetyContain
Apply the lowest-risk containment (cancel runaway query, throttle workload, isolate noisy path, fail over only if needed).
analyticsValidate
Prove recovery with objective signals (p95/p99, queue depth, lock waits, replication lag) and keep rollback criteria explicit.
shieldPrevent
Create one durable prevention artifact: runbook step, alert threshold, migration guardrail, or capacity policy update.
How pg_rewind works: timeline divergence after failover, finding changed blocks since the divergence point, and rejoining an old primary as a standby cheaply.
searchTriage
Confirm blast radius and active impact first (error rate, latency, failed writes, blocked sessions) before touching production state.
health_and_safetyContain
Apply the lowest-risk containment (cancel runaway query, throttle workload, isolate noisy path, fail over only if needed).
analyticsValidate
Prove recovery with objective signals (p95/p99, queue depth, lock waits, replication lag) and keep rollback criteria explicit.
shieldPrevent
Create one durable prevention artifact: runbook step, alert threshold, migration guardrail, or capacity policy update.
Real production symptoms pulled from the error library. Read the symptom, decide your first move, then reveal what it is and why — then grade yourself honestly.
stethoscopeSelf-grade each call
e911_emergencyThe server reports SQLSTATE 2201X (invalid_row_count_in_result_offset_clause), a condition in the Data Exception class. The error is written to the server log and returned to the client carrying SQLSTATE 2201X.
2201X · invalid_row_count_in_result_offset_clause
bug_reportRoot cause
2201X belongs to Class 22 — Data Exception. In this class, a value is invalid for its data type or the operation applied to it. The first two characters (22) identify the error class, so application code can match the whole class via 22000 when the specific code is not needed.
e911_emergencyThe role lacks a required privilege on the object, schema, or database. The error is written to the server log and returned to the client carrying SQLSTATE 42501.
42501 · insufficient_privilege
bug_reportRoot cause
Access control rejected the operation at one of the privilege layers: database CONNECT, schema USAGE, or the object privilege itself. Missing schema USAGE is by far the most common cause. Common causes: Object privilege granted but schema USAGE forgotten. Past GRANT ... ON ALL TABLES does not cover future tables. PostgreSQL 15+ no longer auto-grants…
e911_emergencyThe server reports SQLSTATE 2200C (invalid_use_of_escape_character), a condition in the Data Exception class. The error is written to the server log and returned to the client carrying SQLSTATE 2200C.
2200C · invalid_use_of_escape_character
bug_reportRoot cause
2200C belongs to Class 22 — Data Exception. In this class, a value is invalid for its data type or the operation applied to it. The first two characters (22) identify the error class, so application code can match the whole class via 22000 when the specific code is not needed.
e911_emergencyThe server reports SQLSTATE 42602 (invalid_name), a condition in the Syntax Error or Access Rule Violation class. The error is written to the server log and returned to the client carrying SQLSTATE 42602.
42602 · invalid_name
bug_reportRoot cause
42602 belongs to Class 42 — Syntax Error or Access Rule Violation. In this class, the SQL is malformed or references a missing object or privilege. The first two characters (42) identify the error class, so application code can match the whole class via 42000 when the specific code is not needed.
e911_emergencyThe server reports SQLSTATE 22016 (invalid_argument_for_nth_value_function), a condition in the Data Exception class. The error is written to the server log and returned to the client carrying SQLSTATE 22016.
22016 · invalid_argument_for_nth_value_function
bug_reportRoot cause
22016 belongs to Class 22 — Data Exception. In this class, a value is invalid for its data type or the operation applied to it. The first two characters (22) identify the error class, so application code can match the whole class via 22000 when the specific code is not needed.
e911_emergencyTwo or more transactions blocked each other and PostgreSQL aborted one to break the deadlock. The error is written to the server log and returned to the client carrying SQLSTATE 40P01.
40P01 · deadlock_detected
bug_reportRoot cause
The deadlock detector found a cycle in lock waits and canceled the cheapest victim transaction. Common causes: Transactions acquiring the same rows or tables in different orders. Missing indexes on foreign keys causing wider locks. Long transactions holding locks while waiting.
e911_emergencyConditions: ERROR: connection to server was lost SSL connection has been closed unexpectedly Disconnects happen during idle periods (30 seconds to several minutes), not under load pgBouncer logs show closing because: server idle timeout or server login failed pg_stat_activity shows…
08006 · connection_failure
bug_reportRoot cause
08006 fires when the TCP connection to a PostgreSQL backend is severed mid-session — the connection existed and was valid, then died without a clean close. Unlike 08001 (could not establish connection), 08006 means you had a connection and lost it. Root causes by layer: pgBouncer server_idle_timeout — pgBouncer drops idle server-side connections after a…
e911_emergencyThe server reports SQLSTATE 2200D (invalid_escape_octet), a condition in the Data Exception class. The error is written to the server log and returned to the client carrying SQLSTATE 2200D.
2200D · invalid_escape_octet
bug_reportRoot cause
2200D belongs to Class 22 — Data Exception. In this class, a value is invalid for its data type or the operation applied to it. The first two characters (22) identify the error class, so application code can match the whole class via 22000 when the specific code is not needed.
e911_emergencyThe error appears verbatim in client logs and pg_log: FATAL: sorry, too many clients already Alternatively, when max_connections is set per database or user: FATAL: too many connections for role "app_user" FATAL: too many connections for database "production_db" Observable symptoms…
53300 · too_many_connections
bug_reportRoot cause
PostgreSQL maintains a fixed-size connection slot array allocated at startup. When all slots are filled, there is no backpressure mechanism — new connection attempts are immediately refused with FATAL: sorry, too many clients already. Common causes No connection pooler: The application opens one connection per thread/worker/request directly to PostgreSQL. Under load, this exhausts slots fast.…
e911_emergencyThe server reports SQLSTATE 2200N (invalid_xml_content), a condition in the Data Exception class. The error is written to the server log and returned to the client carrying SQLSTATE 2200N.
2200N · invalid_xml_content
bug_reportRoot cause
2200N belongs to Class 22 — Data Exception. In this class, a value is invalid for its data type or the operation applied to it. The first two characters (22) identify the error class, so application code can match the whole class via 22000 when the specific code is not needed.
e911_emergencyThe server reports SQLSTATE 22025 (invalid_escape_sequence), a condition in the Data Exception class. The error is written to the server log and returned to the client carrying SQLSTATE 22025.
22025 · invalid_escape_sequence
bug_reportRoot cause
22025 belongs to Class 22 — Data Exception. In this class, a value is invalid for its data type or the operation applied to it. The first two characters (22) identify the error class, so application code can match the whole class via 22000 when the specific code is not needed.