Scenario
PostgreSQL uses Simple LRU (SLRU) caches for internal state like commit log, subtransaction status, and multixact data. pg_stat_slru (PostgreSQL 13+) exposes hit rates that indicate whether these caches are under-sized. Diagnose it -- Available since PostgreSQL…
Investigation Path
PostgreSQL uses Simple LRU (SLRU) caches for internal state like commit log, subtransaction status, and multixact data. pg_stat_slru (PostgreSQL 13+) exposes hit rates that indicate whether these caches are under-sized.
Diagnose it
-- Available since PostgreSQL 13:
SELECT name,
blks_zeroed,
blks_hit,
blks_read,
blks_written,
blks_exists,
round(
100.0 * blks_hit / NULLIF(blks_hit + blks_read, 0),
1) AS hit_pct,
flushes,
truncates,
stats_reset
FROM pg_stat_slru
ORDER BY blks_read DESC;
SLRU names correspond to internal subsystems: Async (LISTEN/NOTIFY state),
CommitTs (commit timestamp data), MultiXactMember/MultiXactOffset
(multixact tracking), Notify, Serial (serializable transaction tracking),
Subtrans (subtransaction status), Xact (transaction commit/abort status).
Why it happens
SLRU buffers are sized by fixed internal constants (not GUCs in most cases) and are usually
small relative to their access patterns. A low hit_pct on Subtrans
can indicate heavy use of savepoints or PL/pgSQL exception blocks (each creates a subtransaction).
Low MultiXact hit rates correspond to heavy row-level locking patterns. High
flushes can contribute to I/O pressure.
How to fix it
For Subtrans misses: reduce the use of savepoints or PL/pgSQL exception blocks
that create subtransactions. For MultiXact misses: reduce concurrent row-level
locking (e.g., many concurrent SELECT FOR SHARE on the same rows). Both can be
partially addressed by a PostgreSQL build-time or GUC setting in newer releases — check the
documentation for your specific version.
Prevent it next time
Monitor SLRU stats regularly and correlate LWLock wait events related to SLRU
with these counters. A correlation between high blks_read in a specific SLRU
and wait_event values like SLRURead confirms a cache miss bottleneck.
Reset stats with pg_stat_reset_shared('slru') to establish a fresh baseline.
Related & next steps
Career Impact
This scenario builds production judgment and operational confidence under pressure.