Correlated Column Statistics Missing

Applies to PostgreSQL 13–17 Last reviewed May 2026 Grounded in source

Estimated investigation4 min

Investigation Path

Scenario

A SaaS platform stores customer addresses with city and state columns. A report query filters on both (WHERE city = 'Austin' AND state = 'TX') and returns 50,000 rows, but EXPLAIN shows the planner estimated only 12 rows. The plan uses a Nested Loop that scans millions of rows unnecessarily. No one has ever run CREATE STATISTICS on this table.

How to Identify

Conditions:

Queries filtering on two or more correlated columns produce severely wrong row estimates
EXPLAIN shows rows= while EXPLAIN ANALYZE shows actual rows=
Under-estimation leads to the planner choosing Nested Loop or wrong join strategy
SELECT * FROM pg_statistic_ext returns no rows for the table in question
Per-column pg_stats entries for each column individually look reasonable, but the planner multiplies their selectivities independently (independence assumption)

Analysis Steps

-- 1. Run EXPLAIN ANALYZE to compare estimated vs actual rows
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM customers WHERE city = 'Austin' AND state = 'TX';

-- 2. Check whether any extended statistics objects exist
SELECT stxname, stxrelid::regclass AS table_name,
       stxkeys, stxkind
FROM   pg_statistic_ext
ORDER  BY stxrelid::regclass::text;

-- 3. Inspect per-column ndistinct values — planner multiplies them independently
SELECT attname, n_distinct, correlation
FROM   pg_stats
WHERE  tablename = 'customers'
  AND  attname IN ('city', 'state');

-- 4. Verify the independence assumption error
--    Planner estimate ≈ (rows_for_city / total) * (rows_for_state / total) * total
--    If city and state are correlated, this massively underestimates

-- 5. Check most common values for each column separately
SELECT attname, most_common_vals, most_common_freqs
FROM   pg_stats
WHERE  tablename = 'customers'
  AND  attname IN ('city', 'state');

Pitfalls

Extended statistics require PostgreSQL 10+; CREATE STATISTICS syntax for MCV requires PG 12+
Creating too many extended statistics objects slows ANALYZE — focus on column pairs that appear together in WHERE clauses
Functional dependencies (dependencies) only help when one column functionally determines the other (e.g., zip → city); use ndistinct or mcv for non-deterministic correlations
CREATE STATISTICS collects stats but does NOT automatically trigger ANALYZE — you must run ANALYZE after
Extended stats do not apply to expressions, only to plain column references

Resolution Approach

The planner assumes column selectivities are statistically independent. When columns are correlated (city implies state), the independence assumption causes severe under-estimation. The fix is to create extended statistics that capture the joint distribution:

This is a Pro lesson

Get every Learning Pathway and cookbook recipe — grounded in PostgreSQL source code, with diagnostics, fixes, and prevention for each topic.

Continue this lesson to learn:

Mitigation Actions

All 36 Learning Pathway lessons
170+ cookbook recipes
Source-grounded diagnostics & fixes

Unlock everything — $24.99/month $199/year — save 35%

Secure checkout Cancel anytime Source-grounded

Already a member? Log in · New here? Create a free account

Correlated Column Statistics Missing

What This Is

Why It Matters

Next Step

Scenario

Investigation Path

Scenario

How to Identify

Analysis Steps

Pitfalls

Resolution Approach

Career Impact

Related & next steps

More like this

Concepts on this page

Don't get paged twice for the same bug.

Scenario

Investigation Path

Scenario

How to Identify

Analysis Steps

Pitfalls

Resolution Approach

Related Knowledge

Related Interview Questions

Career Impact

Related & next steps

lan More like this

menu_book Concepts on this page

Don't get paged twice for the same bug.

More like this

Concepts on this page