Write-Ahead Logging: How WAL Guarantees Durability

The one thing to understand first

A transaction is durable when, after COMMIT returns, its effects survive a crash. Flushing every changed data page to disk at commit would be ruinously slow and random. PostgreSQL instead uses Write-Ahead Logging (WAL): before any change reaches a data file, a compact, sequential description of that change is written and flushed to the WAL. The rule (the “write-ahead” rule) is enforced in the buffer manager and is the foundation of crash recovery.

PostgreSQL makes the log durable, not the data — the data pages catch up later. That single inversion explains why commits are fast, why a crash is recoverable, why replication is just “ship the log,” and why a stuck WAL consumer can fill your disk.

The LSN: a byte address in the log

Every WAL tuple/" title="Tuple">record has a Log Sequence Number (LSN) — a 64-bit monotonically increasing byte offset into the logical WAL stream, represented as XLogRecPtr in src/include/access/xlogdefs.h. Each data page header stores the LSN of the last WAL record that modified it (pd_lsn). The buffer manager will not flush a dirty page to disk until the WAL up to that page’s pd_lsn has been flushed — that single check is the write-ahead guarantee.

SELECT pg_current_wal_lsn();          -- current insert position
SELECT pg_walfile_name(pg_current_wal_lsn());  -- which 16MB segment

Inserting a WAL record

Modifying code (heap, btree, etc.) builds a WAL record and calls XLogInsert() (src/backend/access/transam/xloginsert.c). The record describes the change as a redo action: enough information to re-apply it during recovery. Records are appended into shared WAL buffers; XLogInsert returns the LSN of the end of the record.

The first time a page is modified after a checkpoint, PostgreSQL writes a FPI)">full-page image (FPI) into WAL. This protects against torn pages — a partial 8KB write during a crash — because recovery can restore the entire page from the log. Full-page writes are why WAL volume spikes right after a checkpoint.

What happens at COMMIT

On commit, RecordTransactionCommit() (src/backend/access/transam/xact.c) writes a commit WAL record and then calls XLogFlush() to force the WAL up to the commit LSN out to durable storage via fsync. Only after that flush succeeds does the backend report success to the client. The data pages themselves may still be dirty in shared buffers — they will be written lazily by the background writer or the next checkpoint.

The cost of that fsync is why synchronous_commit = off is so much faster: it lets the backend return before the flush, trading a tiny window of potential loss for throughput. It does not risk corruption — only the loss of the most recent commits.

Checkpoints bound recovery time

A checkpoint (CreateCheckPoint() in src/backend/access/transam/xlog.c) flushes all dirty buffers and records a checkpoint WAL record. After a crash, recovery only needs to replay WAL from the last checkpoint’s redo point forward, because everything before it is known to be on disk. checkpoint_timeout and max_wal_size control how often this happens, trading steady-state I/O against recovery time.

Crash recovery: REDO

On startup after an unclean shutdown, StartupXLOG() reads the control file (pg_control) to find the redo point, then replays each WAL record by dispatching to that resource manager’s rm_redo callback (for example, heap_redo). Each record is idempotent against its target page: the redo routine checks the page LSN and skips records already reflected on the page. When replay reaches the end of valid WAL, the database is consistent and opens for connections.

Layer 3 — Watch it happen on your own database

-- See WAL generation rate and replay position
SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS total_wal;
SELECT slot_name, restart_lsn, pg_size_pretty(
       pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained
FROM   pg_replication_slots;

Run an UPDATE, then call pg_current_wal_lsn() again — the LSN advances by the size of the WAL you just generated. The second query is the one you reach for in an incident: a slot whose retained bytes keep climbing is the reason your disk is filling.

Layer 4 — The levers this hands you

WAL is also the replication stream. Streaming replication ships these same records to standbys, which run the redo path continuously — so WAL volume is replication bandwidth.
Archiving WAL enables PITR. A base backup plus archived WAL lets you replay to any point in time; archive_command/archive_library is what makes that possible.
WAL bloat has real causes. An unconsumed replication slot or failing archive_command prevents segment recycling and can fill the disk — cap the damage with max_slot_wal_keep_size.
synchronous_commit is a throughput dial, not a safety switch. Turning it off skips the commit fsync wait and risks only the last few commits — never corruption.

Layer 5 — What an Oracle DBA should expect vs what they get

WAL is Oracle’s redo log under a different name, with a few sharp differences:

One log, not redo + UNDO. Oracle separates redo (for roll-forward) from UNDO (for rollback and read consistency). PostgreSQL’s WAL is roll-forward only; rollback and MVCC are handled by dead tuples in the heap, so there is no UNDO log to size or tune.
Full-page images instead of a fractured-block fix-up. The first change to a page after each checkpoint writes the whole page into WAL to defeat torn writes — which is why Postgres WAL volume spikes right after a checkpoint. Oracle’s equivalent concern (fractured blocks) is handled differently; the post-checkpoint WAL surge surprises Oracle DBAs.
LSN is your SCN. The XLogRecPtr LSN plays the role of Oracle’s SCN as the global progress marker, but it is a literal byte offset into the log — you can subtract two LSNs to get bytes, which is how you measure replication lag and WAL rate directly.
No separate archiver process by default tuning. Archiving is a command/library you configure, not a background ARCn you manage; getting archive_command right (and monitored) is the Postgres equivalent of babysitting the archive destination.

Key takeaway

Durability in PostgreSQL is a property of the log, not the data files. XLogInsert() appends a redo description, XLogFlush() forces it to disk at commit, and the dirty data pages are written lazily afterward. Crash recovery replays WAL forward from the last checkpoint’s redo point. Internalize “make the log durable, ship the log, replay the log” and replication, PITR, and WAL-disk incidents all become the same story told three ways.