Scenario
A financial transaction processing service runs all operations in SERIALIZABLE isolation to guarantee correctness. During peak hours, application logs show a 30% failure rate with SQLSTATE 40001 (serialization_failure). Retry logic exists but the retries themselves also fail, creating a thundering herd: all retried transactions restart simultaneously, conflict with each other again, and retry again — multiplying the load.
How to Identify
Conditions:
- Application logs contain
ERROR: could not serialize access due to concurrent update or ERROR: could not serialize access due to read/write dependencies among transactions
SQLSTATE: 40001 errors in application error tracking
pg_stat_database.conflicts counter rising (note: this tracks replica conflicts, but application-side retries indicate serialization pressure)
- Error rate correlates directly with concurrent transaction count during peak hours
- Retry storms visible in application metrics — retry spikes coincide with original failure spikes
Analysis Steps
-- 1. Check serialization failure rate in the database
SELECT
datname,
xact_commit,
xact_rollback,
(xact_rollback::float / NULLIF(xact_commit + xact_rollback, 0) * 100)
AS rollback_rate_pct
FROM pg_stat_database
WHERE datname = current_database();
-- 2. Check current isolation level for the session
SHOW transaction_isolation;
SHOW default_transaction_isolation;
-- 3. Look for transactions currently running in SERIALIZABLE mode
SELECT
pid,
usename,
state,
left(query, 80) AS query,
now() - xact_start AS xact_duration
FROM pg_stat_activity
WHERE state = 'active'
AND now() - xact_start > interval '100ms'
ORDER BY xact_duration DESC;
-- 4. Examine pg_stat_database for conflict indicators
SELECT
datname,
conflicts,
deadlocks,
xact_rollback
FROM pg_stat_database
WHERE datname = current_database();
-- 5. Check if serializable_failure_on_conflict is a known pattern
-- Look at the rate of rollbacks as a proxy for serialization failures:
SELECT datname,
xact_rollback,
deadlocks,
blk_read_time,
blk_write_time
FROM pg_stat_database
WHERE datname = current_database();
Pitfalls
- Serialization failures are normal and expected in
SERIALIZABLE isolation — they are a correctness mechanism, not a database bug. The application MUST always retry transactions that fail with SQLSTATE 40001.
- Do not use
REPEATABLE READ assuming it prevents all anomalies — it prevents non-repeatable reads but allows write skew anomalies that SERIALIZABLE catches and rejects.
- Not all
40001 errors need DBA investigation — a small, steady failure rate with successful retries is acceptable. Only high rates or thundering herds need architectural intervention.
- Naive retry logic without exponential backoff causes thundering herd — all retrying transactions restart simultaneously and conflict again. Exponential backoff with jitter breaks the synchronization.
- Minimizing transaction duration is the most effective mitigation — shorter transactions have shorter conflict windows and lower probability of overlapping with concurrent transactions.
Resolution Approach
Serialization failures require application-layer handling: implement exponential backoff with random jitter so retries do not thunder together. Minimize transaction duration by moving non-database work outside the transaction boundary. For operations where exact serializable isolation is not strictly required, consider READ COMMITTED with explicit SELECT FOR UPDATE locking to avoid the retry cost while maintaining correctness for specific rows.