The one thing to understand first
PostgreSQL never lets a reader block a writer or a writer block a reader for ordinary SELECT/INSERT/UPDATE/DELETE traffic. It achieves this with Multi-Version Concurrency Control (MVCC): instead of overwriting a row in place, every modification produces a new physical copy — a tuple version — and the system decides at read time which version each transaction is allowed to see.
An UPDATE in PostgreSQL is not an edit; it is an insert plus a tombstone. Once that clicks, bloat, vacuum, long-transaction hazards, and isolation levels all stop being surprising. The whole mechanism lives in the on-disk tuple header — understand that struct and you understand MVCC.
The heap tuple header
Every row stored in a heap page is prefixed by a HeapTupleHeaderData struct, defined in src/include/access/htup_details.h. The fields that drive visibility are:
t_xmin— the transaction ID (XID) that inserted this version.t_xmax— the XID that deleted or updated this version (0 if still live).t_cid— command ID, distinguishing changes within the same transaction.t_ctid— a(block, offset)pointer to the next version of this row (used for UPDATE chains).t_infomask/t_infomask2— bit flags caching commit status and other metadata.
You can see these directly with the pageinspect extension:
CREATE EXTENSION IF NOT EXISTS pageinspect;
SELECT lp, t_xmin, t_xmax, t_ctid, t_infomask::bit(16)
FROM heap_page_items(get_raw_page('accounts', 0));
What an UPDATE actually does
An UPDATE is not an in-place edit. In heap_update() (src/backend/access/heap/heapam.c) PostgreSQL:
- Stamps the old tuple’s
t_xmaxwith the current transaction’s XID, marking it deleted as of this transaction. - Writes a new tuple with
t_xmin= current XID. - Sets the old tuple’s
t_ctidto point at the new tuple, forming an update chain.
This is why a heavily updated table accumulates dead tuples and why VACUUM is mandatory: the old versions remain physically present until no transaction can still see them.
How visibility is decided
When a scan reads a tuple, it calls a visibility routine — for the common case, HeapTupleSatisfiesMVCC() in src/backend/access/heap/heapam_visibility.c. Simplified, a tuple is visible to a snapshot if:
- its
t_xmincommitted before the snapshot was taken, and - its
t_xmaxis either zero, or belongs to a transaction that had not committed when the snapshot was taken.
A snapshot (SnapshotData in src/include/utils/snapshot.h) records xmin (oldest still-running XID), xmax (next XID to be assigned), and the list of in-progress XIDs. That triple is enough to classify every tuple as visible or invisible without any locking.
The hint-bit optimisation
Checking commit status via the commit log (pg_xact, formerly pg_clog) on every read would be expensive. PostgreSQL caches the answer in t_infomask using bits such as HEAP_XMIN_COMMITTED and HEAP_XMAX_COMMITTED. The first reader that resolves the status sets the hint bit; later readers skip the commit-log lookup. This is why a plain SELECT can dirty pages and generate I/O — it is writing hint bits back.
Layer 3 — Watch it happen on your own database
-- Watch xmin/xmax change across an UPDATE
CREATE TABLE demo (id int primary key, v text);
INSERT INTO demo VALUES (1, 'a');
SELECT xmin, xmax, ctid FROM demo; -- xmax = 0 (live)
UPDATE demo SET v = 'b' WHERE id = 1;
SELECT xmin, xmax, ctid FROM demo; -- new ctid, new xmin
-- The old version still exists physically until VACUUM runs.
The ctid changes and xmin jumps to your new transaction’s XID — proof that the row was not edited but re-inserted. The old version is still on the page, invisible to you, waiting for vacuum.
Layer 4 — The levers this hands you
- Bloat is structural, not a bug. Dead versions are the price of lock-free reads;
autovacuumreclaims them. Tune it per table rather than fighting MVCC. - Long-running transactions are dangerous. They hold back the snapshot
xmin, preventing vacuum from removing any tuple newer than that XID — across the entire cluster. Hunt them withpg_stat_activity.backend_xmin. - HOT updates (Heap-Only Tuples) avoid touching indexes when no indexed column changes, keeping the update chain inside one page. Lower
fillfactoron update-heavy tables to make HOT more likely. See theHEAP_HOT_UPDATEDflag.
Layer 5 — What an Oracle DBA should expect vs what they get
If you come from Oracle, MVCC will feel familiar but the bookkeeping lives somewhere completely different:
- No UNDO/rollback segments. Oracle keeps the old image in UNDO and overwrites the row in place; readers reconstruct the past from UNDO. PostgreSQL keeps the old image in the table itself as a dead tuple. That is why Postgres has no “ORA-01555 snapshot too old” from UNDO pressure — but does have table bloat and a hard dependency on vacuum.
- “Where did my UNDO tablespace go?” It does not exist. The cost of versioning shows up as heap and index bloat instead, and your tuning energy moves from UNDO retention to autovacuum aggressiveness.
- Rollback is instant, cleanup is deferred. Oracle pays at rollback; PostgreSQL simply marks the transaction aborted and lets vacuum reclaim the orphaned tuples later. A huge transaction that aborts is cheap to roll back but leaves a vacuum bill.
Key takeaway
MVCC means “append new versions, decide visibility at read time.” Every UPDATE writes a new tuple and tombstones the old one via t_xmax; the planner and every scan use the snapshot rules to pick which versions you may see. Once you hold that model, vacuum behaviour, bloat, transaction-ID wraparound, and isolation levels all follow logically — and the fix for most “mystery bloat” is a long transaction, not a Postgres bug.