The Buffer Cache: How shared_buffers and Clock-Sweep Work

The one thing to understand first

shared_buffers is a fixed-size array of 8KB page frames allocated in shared memory at startup. Every backend reads and writes data through this cache; a page on disk must be loaded into a buffer before it can be examined or modified. The machinery lives in src/backend/storage/buffer/, principally bufmgr.c and freelist.c.

Nothing in PostgreSQL touches a row on disk directly — it touches a copy in a buffer, and a clock hand decides which copies survive. Grasp that the cache is a fixed array managed by an approximate-LRU sweep, and eviction behaviour, the post-checkpoint write storm, and why huge shared_buffers can backfire all make sense.

Three parallel structures

The cache is described by three arrays sized to NBuffers:

Buffer blocks — the actual 8KB page data.
BufferDescriptors — one BufferDesc per frame (src/include/storage/buf_internals.h) holding the page’s identity (BufferTag: relfilenode + fork + block number), a pin count, a usage count, and state flags packed into an atomic word.
The buffer mapping table — a shared hash table mapping BufferTag → buffer id, so a backend can find whether a page is already cached.

Reading a page

ReadBuffer() → BufferAlloc() computes the BufferTag, hashes it, and looks it up in the mapping table under a partition lock. On a hit, the buffer is pinned and returned. On a miss, a victim frame must be chosen, the old page evicted (written first if dirty), the mapping table updated, and the new page read from disk.

Pinning vs locking

Two independent concepts protect a buffer:

A pin (reference count) means “this buffer is in use, do not evict it.” Holding a pin does not block other readers.
A content lock (an LWLock, shared or exclusive) protects the page contents during read or modification.

A backend pins a buffer for as long as it holds a tuple from that page, which is why cursors and long scans keep pages resident.

Clock-sweep: choosing a victim

PostgreSQL does not use strict LRU. Instead StrategyGetBuffer() in freelist.c implements a clock-sweep approximation of LRU-2:

Each buffer has a usage_count (0–5). It is incremented on access, capped at BM_MAX_USAGE_COUNT.
A shared “clock hand” sweeps circularly through the buffers.
At each buffer the hand decrements usage_count. If it reaches 0 and the buffer is unpinned, that buffer is the victim.

This makes frequently used pages survive many sweeps (they keep getting re-warmed) while cold pages drain to zero and get evicted — all without maintaining an ordered list, which would be a contention nightmare under concurrency.

Ring buffers for big scans

A large sequential scan or VACUUM would otherwise evict the entire working set (“cache trashing”). To prevent this, those operations use a small buffer access strategy ring (GetAccessStrategy()), reusing a handful of buffers instead of polluting the whole cache. That is why scanning a 100GB table does not flush your hot OLTP pages.

Dirty pages and write-back

When a page is modified, its buffer is marked dirty (BM_DIRTY). Dirty buffers are written back by three actors: the background writer (smooths I/O ahead of demand), the checkpointer (flushes everything at a checkpoint), and as a last resort a backend that needs a clean victim. Every write obeys the WAL rule: flush WAL up to the page’s LSN first.

Layer 3 — Watch it happen on your own database

CREATE EXTENSION IF NOT EXISTS pg_buffercache;
SELECT c.relname, count(*) AS buffers,
       pg_size_pretty(count(*) * 8192) AS cached
FROM   pg_buffercache b
JOIN   pg_class c ON b.relfilenode = pg_relation_filenode(c.oid)
GROUP  BY c.relname ORDER BY buffers DESC LIMIT 10;

This shows exactly which relations are resident right now. Run a large sequential scan of a cold table and watch its buffer count rise; run it again and the second pass is far faster because the pages are already framed. That is the clock sweep keeping hot pages and evicting cold ones in action.

Layer 4 — The levers this hands you

Bigger is not always better. Beyond a point, the OS page cache and double-buffering make very large shared_buffers counter-productive; 25% of RAM is a common starting point.
Measure hit ratio from pg_stat_database, but remember the OS cache catches many “misses,” so a 99% Postgres hit ratio is not the whole truth.
Smooth the checkpoint write storm with checkpoint_completion_target and bgwriter_* settings so dirty-buffer flushes are spread out rather than spiking.
pg_buffercache and pg_prewarm let you inspect and pre-load the cache after a restart.

Layer 5 — What an Oracle DBA should expect vs what they get

This is Oracle’s buffer cache in the SGA, but smaller in ambition and reliant on the OS:

No “SGA owns all of RAM.” Oracle typically sizes the buffer cache to hold most of the working set and bypasses the filesystem cache. PostgreSQL deliberately keeps shared_buffers modest and leans on the OS page cache as a second tier — which is why double-buffering, not raw cache size, is the tuning concern.
Clock sweep instead of touch-count LRU lists. Oracle uses LRU/touch-count chains; PostgreSQL uses an approximate-LRU clock sweep with a usage_count that decrements on each pass. The effect is similar (hot pages stay) but there are no separate hot/cold LRU regions to tune.
No DBWR army. Background writing is done by a single bgwriter plus the checkpointer, not a configurable pool of DBWn processes. You tune write smoothing through checkpoint and bgwriter GUCs rather than DBWR counts.
No KEEP/RECYCLE pools. There is one buffer pool. To pin a hot table you pre-warm it with pg_prewarm rather than assigning it to a KEEP pool.

Key takeaway

The buffer cache is a fixed array of 8KB frames, and an approximate-LRU clock sweep with per-buffer usage_count decides what stays. Dirty buffers are flushed by the checkpointer and bgwriter, not at commit — WAL already guarantees durability. Size shared_buffers for the working set but leave room for the OS cache, and treat the post-checkpoint write spike as the cost of deferring page writes, not a bug to eliminate.